Friday, April 19, 2013

What grammatical challenges prevent Google Translate from being more effective?

Cross-posting my answer to the question in the topic on [1].

Google is pretty good at modeling close enough language pairs. By close enough I mean languages that share multiple vocabulary units, have similar word order, morphological richness level and other grammatical features.

Let's pick an example of a pair, where Google Translate (GT) is good. Round-trip method is one way to verify whether the languages are close enough, at least statistically, for GT:

(these examples are using GT only, no human interpretation involved)

English: I am in a shop.
Dutch: Ik ben in een winkel.
back to English I'm in a store. (quite ok)

English: I danced into the room.
Dutch: Ik danste in de kamer.
back to English: I danced in the room. (preposition issues)

Let's pick a pair of more unrelated languages (by the way, when we claim the languages are unrelated grammatically, they may also be unrelated semantically or even pragmatically: different languages were created by people to suit their needs at particular moments of history). One such pair is English and Finnish:

Finnish: Hän on kaupassa.
English: He is in the shop.
Finnish: Hän on myymälä. (roughly the original Finnish sentence)

This example has pronoun hän, which in Finnish is not gender specific. It should be resolved based on larger context, than just a sentence. Somewhere before this sentence in a text, there should have been a mention of who hän is referring to.

To conclude this particular example: Google Translate translates on a sentence level and that is a limitation in itself, that makes correct pronoun resolution impossible. Pronouns are useful, if we wanted to understand, what was the interaction between the objects in a text.

Let's pick another example of unrelated languages: English and Russian.

Russian: Маска бывает правдивее и выразительнее лица.
English: The mask is truthful and expressive face. (should have been: The mask can be more truthful and expressive than face)
back to Russian: Маска правдивым и выразительным лицом. (hard to translate, but the meaning roughly: The mask being a truthful and expressive face).

To conclude this example: languges with rich morphology that, in the case of the Russian language, convey grammatical case in just a word inflection and thus require deeper grammatical analysis, which pure statistical machine translation methods lack no matter how much data has been acquired. There exist methods of combining rules and statistics together.

Another pair and different example:
English: Reporters said that IBM has bought Lotus.
Japanese: 記者は、IBMがロータスを買っていると述べた。
back to English: The reporter said that IBM Lotus are buying.

Japanese has a "recursive syntax", that represents this English sentence, like:

Reporters (IBM Lotus has bought) said that.

i.e. the verb is syntacically placed after the subject-object pair of a sentence or a sub-sentence (direct / indirect object).

To conclude this example: there should exist a method of mapping syntax structures as larger units of the language and that should be done in a more controlled fashion (i.e. is hard to derive from pure statistics).


No comments: