Today I played around the Google's online MT system for some experiments, and it seems that the Google's MT system produce surprisingly good results! Lets see this for an example(English to Chinese):
"An international force must be swiftly sent to Lebanon, the US president says, after talks with Tony Blair."
With one of the leading offline Chinese English MT systems (AIT金山快译), it produced:
"一个国际的力量一定很快地被送到黎巴嫩, 在和汤尼.布莱尔的谈话之后美国籍的总统说。"
Google produced:
"一支国际部队必须迅速送往黎巴嫩,美国总统说,在与布莱尔的会谈. "
now, I believe we all agree that Google's translation is much better. And people will probably be amazed by its
"intelligent" statistical/rule-based algorithm, well.... is it? let's investigate a bit more:
I feed google with the following sentence: "HONG KONG, July 1 (Xinhua)".
now, before revealing the answer,think , what will you translate? probably "香港7月1日,(新华)" right?
this is even assuming that you know Xinhua means 新华 or 新华社。
ok,so what does google produce? it is
"新华社香港7月1日电(记者罗辉) "
now, isn't is amazing that "HONG KONG, July 1 (Xinhua)" is translated perfectly into 新华社香港7月1日电 ?
what bothered me is the part in parentheses, what does (记者罗辉) come from???
There is noway one can produce such translations by any pure statistical methods. It must be a Template, a rule, that
tells google that HONG KONG, July 1 (Xinhua) means 新华社香港7月1日电(记者罗辉) , but how?
My guess it that Google utilises some sort of parallel corpus so it knew the sentence pair.
Now this raise an interesting question: if we can produce every possible phrase or sentence pairs, we can make good
translation system, but is it still counts for "intelligent" or even "artificial intelligence"?
the following is a simple Perl program that exploits Google translation, feed it sentence and "grab" results from web.
works in Windows, but when in unix/linux, the utf8 output is wrong and I am not quite sure why, if you know how to solve it, please tell me .
#! /usr/bin/perl -w
#my $langpair = "en|de"; #English to German
#my $langpair = "en|es"; #English to Spanish
#my $langpair = "en|fr"; #English to French
#my $langpair = "en|it"; #English to Italian
#my $langpair = "en|pt"; #English to Portuguese
#my $langpair = "en|ar"; #English to Arabic BETA
#my $langpair = "en|ja"; #English to Japanese BETA
#my $langpair = "en|ko"; #English to Korean BETA
my $langpair = "en|zh-CN" #English to Chinese (Simplified) BETA
#my $langpair = "de|en"; #German to English
#my $langpair = "de|fr"; #German to French
#my $langpair = "es|en"; #Spanish to English
#my $langpair = "fr|en"; #French to English
#my $langpair = "fr|de"; #French to German
#my $langpair = "it|en"; #Italian to English
#my $langpair = "pt|en"; #Portuguese to English
#my $langpair = "ar|en"; #Arabic to English BETA
#my $langpair = "ja|en"; #Japanese to English BETA
#my $langpair = "ko|en"; #Korean to English BETA
#my $langpair = "zh-CN|en"; #Chinese (Simplified) to English BETA
my $text = "";
while(<>){
$text = $_ ;
# get result from google translation
my $url = "http://translate.google.com/translate_t?hl=en&ie=UTF8&oe=UTF8&langpair=$langpair&text=$text";
use LWP 5.64; # Loads all important LWP classes, and makes
# sure your version is reasonably recent.
# Then later, whenever you need to make a get request:
my $response = $browser->get($url, @ns_headers);
die "Can't get $url -- ", $response->status_line
unless $response->is_success;
die "Hey, I was expecting HTML, not ", $response->content_type
unless $response->content_type eq 'text/html';
# or whatever content-type you're equipped to deal with
# Otherwise, process the content somehow:
$str = $response->content;