Tuesday, October 24, 2017

What grammatical challenges prevent Google Translate from being more effective?

Here is one more Quora question on the exciting topic of machine translation and my answer to it.

The question had some sub-questions:

  • Is there a set of broad grammatical rules which decreases its efficacy?
  • How can these challenges be overcome? Is it possible to fully automate good quality translation?

Below is my answer, hoping it will be interesting to learn about machine translation and different language pairs. Note, that translations given currently by Google Translate might differ from below as they were obtained in 2013. UPD: and they do! See comments to this post.

Google is pretty good at modeling close enough language pairs. By close enough I mean languages that share multiple vocabulary units, have similar word order, morphological richness level and other grammatical features.

Let's pick an example of a pair, where Google Translate (GT) is good. Round-trip method is one way to verify whether the languages are close enough, at least statistically, for GT:

(these examples are using GT only, no human interpretation involved)

English: I am in a shop.
Dutch: Ik ben in een winkel.
back to English I'm in a store. (quite ok)

English: I danced into the room.
Dutch: Ik danste in de kamer.
back to English: I danced in the room. (preposition issues)

Let's pick a pair of more unrelated languages (by the way, when we claim the languages are unrelated grammatically, they may also be unrelated semantically or even pragmatically: different languages were created by people to suit their needs at particular moments of history). One such pair is English and Finnish:

Finnish: Hän on kaupassa.
English: He is in the shop.
Finnish: Hän on myymälä. (roughly the original Finnish sentence)

This example has pronoun hän, which in Finnish is not gender specific. It should be resolved based on larger context, than just a sentence. Somewhere before this sentence in a text, there should have been a mention of who hän is referring to.

To conclude this particular example: Google Translate translates on a sentence level and that is a limitation in itself, that makes correct pronoun resolution impossible. Pronouns are useful, if we wanted to understand, what was the interaction between the objects in a text.

Let's pick another example of unrelated languages: English and Russian.

Russian: Маска бывает правдивее и выразительнее лица.
English: The mask is truthful and expressive face. (should have been: The mask can be more truthful and expressive than face)
back to Russian: Маска правдивым и выразительным лицом. (hard to translate, but the meaning roughly: The mask being a truthful and expressive face).

To conclude this example: languges with rich morphology that, in the case of the Russian language, convey grammatical case in just a word inflection and thus require deeper grammatical analysis, which pure statistical machine translation methods lack no matter how much data has been acquired. There exist methods of combining rules and statistics together.

Another pair and different example:
English: Reporters said that IBM has bought Lotus.
Japanese: 記者は、IBMがロータスを買っていると述べた。
back to English: The reporter said that IBM Lotus are buying.

Japanese has a "recursive syntax", that represents this English sentence, like:

Reporters (IBM Lotus has bought) said that.

i.e. the verb is syntacically placed after the subject-object pair of a sentence or a sub-sentence (direct / indirect object).

To conclude this example: there should exist a method of mapping syntax structures as larger units of the language and that should be done in a more controlled fashion (i.e. is hard to derive from pure statistics).


Ted Dunning ... apparently Bayesian said...

Google translate now gives "The mask is more truthful and expressive than the face". This is more emphatic than the original. The round trip goes to "Маска более правдивая и выразительная, чем лицо" which similarly lacks the uncertainty of the original.

Overall, however, much better than your example. This is the problem of systems that are continuously improving.

Dmitry Kan said...

Hi Ted,

Glad to see you on the blog.

Yes, I got similar comment over at g+, that GT has improved. Apparently so by switching from statistics based method to neural MT. It is pretty impressive.

I just thought that posting as many of these examples should help improve GT also or even just server as quality anchors.