Friday, April 26, 2013

MTEngine: latest developments

Here are the latest developments going on on our test environment: MTEngine_test:

1. We took a snapshot of sentences from opencorpora.org and are working on pushing these into the new UI feature, called "tasks". Each task is one Russian sentence to be translated and rated by the user.
2. The feature with a free-form translation remains in the UI and is pushed into its own tab (screenshot in the Russian version of this message below).

For the production version of MTEngine we have done one improvement: when registering and using the system for the first time, the dictionary entries will be looked from the common dictionary, contributed by all our users.

Happy translations!


Same in Russian:

Свежие разработки в тест версии проекта MTEngine:

1. Мы взяли дамп предложений проекта opencorpora.org и работаем над новой фичей под названием "задания". Каждое задание -- это одно предложение на русском языке для перевода и оценки пользователем.
2. Фича с произвольным переводом пользовательских предложений на русском языке будет находится в отдельном табе:



Мы сделали улучшение и в продакшн версии: теперь, когда пользователь регистрируется и делает первые переводы, словарные единицы берутся из общего переводного словаря, который создали все пользователи проекта.

Успешных переводов и хороших выходных!


Friday, April 19, 2013

What grammatical challenges prevent Google Translate from being more effective?

Cross-posting my answer to the question in the topic on quora.com [1].

Google is pretty good at modeling close enough language pairs. By close enough I mean languages that share multiple vocabulary units, have similar word order, morphological richness level and other grammatical features.

Let's pick an example of a pair, where Google Translate (GT) is good. Round-trip method is one way to verify whether the languages are close enough, at least statistically, for GT:

(these examples are using GT only, no human interpretation involved)

English: I am in a shop.
Dutch: Ik ben in een winkel.
back to English I'm in a store. (quite ok)

English: I danced into the room.
Dutch: Ik danste in de kamer.
back to English: I danced in the room. (preposition issues)


Let's pick a pair of more unrelated languages (by the way, when we claim the languages are unrelated grammatically, they may also be unrelated semantically or even pragmatically: different languages were created by people to suit their needs at particular moments of history). One such pair is English and Finnish:

Finnish: Hän on kaupassa.
English: He is in the shop.
Finnish: Hän on myymälä. (roughly the original Finnish sentence)

This example has pronoun hän, which in Finnish is not gender specific. It should be resolved based on larger context, than just a sentence. Somewhere before this sentence in a text, there should have been a mention of who hän is referring to.

To conclude this particular example: Google Translate translates on a sentence level and that is a limitation in itself, that makes correct pronoun resolution impossible. Pronouns are useful, if we wanted to understand, what was the interaction between the objects in a text.


Let's pick another example of unrelated languages: English and Russian.

Russian: Маска бывает правдивее и выразительнее лица.
English: The mask is truthful and expressive face. (should have been: The mask can be more truthful and expressive than face)
back to Russian: Маска правдивым и выразительным лицом. (hard to translate, but the meaning roughly: The mask being a truthful and expressive face).

To conclude this example: languges with rich morphology that, in the case of the Russian language, convey grammatical case in just a word inflection and thus require deeper grammatical analysis, which pure statistical machine translation methods lack no matter how much data has been acquired. There exist methods of combining rules and statistics together.


Another pair and different example:
English: Reporters said that IBM has bought Lotus.
Japanese: 記者は、IBMがロータスを買っていると述べた。
back to English: The reporter said that IBM Lotus are buying.

Japanese has a "recursive syntax", that represents this English sentence, like:

Reporters (IBM Lotus has bought) said that.

i.e. the verb is syntacically placed after the subject-object pair of a sentence or a sub-sentence (direct / indirect object).

To conclude this example: there should exist a method of mapping syntax structures as larger units of the language and that should be done in a more controlled fashion (i.e. is hard to derive from pure statistics).


References
[1] http://www.quora.com/Linguistics/What-grammatical-challenges-prevent-Google-Translate-from-being-more-effective