About the translation algorithm

A brief introduction to our machine translation algorithm

We have implemented statistical machine translation (SMT). SMT is completely data driven. It works by calculating word and phrase probabilities from a large corpus. We have used OPUS and Wiktionary as our primary sources.

Data models

From the data sources (mostly bilingual parallel language sets) a dictionary of translations is constructed. For each translation we keep a count and parts of speech tags for both source and target, this is our translation model & pos models and it looks something like:

Translation & pos models
source word|source pos tags|translation count|target word |target pos tags
om|conj|12|if|conj
om|adv|7|again|adj
övermorgon|adv|3|the day after tomorrow|det noun prep noun
...

For the target language a language model and a grammar model is used. Each consists of 1-5 n grams. The language model consists of word sequences and a frequency, the grammar model of pos tags and their frequencies:

Language model
phrase|count
hello word|493920
hi world|19444
...
Grammar model
pos tags|count
prep noun|454991
prep prep|3183
...

Building a graph

So we have data. Plenty of data. Now we just need to make use of it. When a text is translated a graph is built between all possible translations, most of the time each word has multiple translations and meanings, so the number of combinations grows very quickly. During the graph building we need to remember that source phrases can contract, e.g. ‘i morgon’=>’tomorrow’ and expand ‘övermorgon’=>’the day after tomorrow’.

We look at a maximum of 5 words. Once the graph is built, a traversal is initiated. As we traverse the graph encountered sub phrases are scored and the best path is chosen.

Graph for 'hej världen!'
hej       världen       !
--------------------------
Translations:
hi        world         !
hello     universe
howdy     earth
hey

Combinations:
hi        world         !
hi        universe      !
hi        earth         !
hello     world         !
hello     universe      !
hello     earth         !
...

Unfortunately there is no way to examine all translations so we need to traverse the graph intelligently. We use a beam search with limited depth and width to get the scope down to manageable scales.

Scoring phrases

The scoring of each phrases combines  the four aforementioned aspects of the language:

Translation model: This is the dictionary, source->target each entry has a frequency, from the frequency we can calculate a probability (p1) “the most likely translation for ‘hej’ this word is ‘hello'”

Source grammar model: The pos tag helps us to resolve ambiguity, a probability (p2) is calculated, basically saying “‘hej’/’hello’ is likely an interjection”.

Target language model: We look at 1-5 grams. A n-gram is a sequence of words, for example “hello world” is a 2-gram. Each n-gram has a frequency indicating how common it’s. Again a probability (p3) can be calculated, “the sequence ‘hello world’ is more likely than ‘hi world'”.

Target grammar model: just like the language model we do the same but with pos tags. A probability (p4) is calculated indicating “Yeah a verb followed by a preposition sounds better than two prepositions in a row” etc.

We use a sliding window moving over the phrase and combining probabilities using the chain rule into accumulated P1-P4. We end up with 4 parameters that are finally mixed with different weights according to

score=P1^w1*P2^w2*P3^w3*P4^w4

Working in log space makes life easier here. Then we just select the phrase with the highest score.

We estimate the weights (w1-w4) by a randomized search that tries to maximize a bleu-score for a test set. The estimation only needs to be rerun when the training data changes. As expected, the most important (highest weight) is assigned the translation model (w1=1), second highest the source grammar model (w3~0.6), third highest the language model (w2~0.3) and finally the target grammar model (w4~0.05). Yes, the as it turns out the target grammar model is not very important, it helps to resolve uncertainty in some cases by predicting pos tags. But I might actually nuke it to favor simplicity in future versions.

There were plenty of unmentioned problems to be solved along the way, but you get the overall idea. One thing that easily puts you off is the size of the data you are dealing with. E.g. downloading TB sized datasets like the google ngrams and processing those. At one point, after 4 days processing those huge zipfiles, Windows Update decided to restart the computer…