The Real Problems with Neural Machine Translation

TLDR: No! Your Machine Translation Model is not “prophesying”, but let’s look at the six major issues with neural machine translation (NMT).

So I saw a Twitter thread today with the editor-in-chief of Motherboard tweeting, “Google Translate is popping out bizarre religious texts and no one is sure why“. This post in response to that. I am going to spend a little time on the “why” part (folks who work in MT know why), but mostly focus on actual problems with neural machine translation.

But first, a small digression on responsible AI reporting. The choice of headlines, the promotion tweet, and the tone of the article reminds me of all the irresponsible writing that went around the famous “Facebook Frankenstein” experiment.

I would not be surprised if other media outlets picked up this Motherboard piece and ran ridiculous stories about machine translation conspiracy theories. If you are writing about AI as a journalist, using a sensational lede to bring in engagement to your website is the worst thing you can do as most people just read the ledes and share on social networks with “ZOMG! AI is going kill us!” This distracts us from the actual harms of current-day machine learning like biases, discrimination, and weaponization of AI. Ok. Enough rant; back to technical programming …

So why religious texts in the outputs? When it comes to parallel corpora, religious texts are the lowest common denominator resource. Major religious texts like the Bible and the Koran exist in many (all?) languages. Back in graduate school, I used a “Dianetics parallel corpus” for a project, simply because, like the Bible, all the proselytizing has it translated to more than 50 languages, including some low-resource ones. Say, if you are bootstrapping a machine translation system for Urdu-to-English for the government, it is easy to put together a bunch of religious texts already translated to Urdu. So one can safely assume Google’s parallel corpus collection contains all religious texts, and for many low-resource languages, they are not an insignificant portion of the training corpus.

One explanation of why we see religious text outputs in particular for garbage inputs, esp. in low-resource language pairs (somali-english in the figure above): Religious texts contain a lot of rare words that don’t occur anywhere else but in the context of religious texts. So, rare words could trigger religious contexts in the decoder, esp. when the proportion of such texts are large. Another explanation is the model does not have much statistical support for the input and the output is simply a gobbledygook sampling from the decoder model.

So what are some real problems in neural machine translation today?

Philipp Koehn and Rebecca Knowles wrote a fantastic WNMT paper on this topic in 2017 that’s still relevant now. To summarize:

1. NMT sucks with out-of-domain data: Current machine translation systems produce very fluent outputs unrelated to the input for out-of-domain data. So a general MT system like Google Translate will do particularly bad in specialized domains like legal or finance. Compared to traditional methods like phrase-based systems, NMT systems do especially worse. How bad? See the figure below from the paper. The off-diagonal elements are out-of-domain results, and the green bars are NMT and the blue bars are phrase-based systems.

MT systems trained on one domain (rows) and tested on another domain (cols). Blue: Phrase-based, Green: NMT.

2. NMT sucks with small datasets: While this can be said about most machine learning in general, this problem is especially pronounced in NMT. The promise of NMT is it generalizes better (than phrase-based MT) with increasing data, but under low data situations, NMT does significantly worse. In fact, to quote the authors, “Under low resource conditions, NMT produces fluent output unrelated to the input.” This could be another reason for some of the NMT weirdness explored in the Motherboard article.

3. Subword NMT sucks for rare words: While still better than phrase-based translation, NMT does poorly for rare or unseen words. This can become a problem for highly-inflected languages and domains with a lot of named entities where the inflected forms and named entity occurrences respectively can be very rare.

Figure excerpt from Chapter 2 of our upcoming book. Click on image to zoom. In a language like Turkish, for example, it is quite common to encounter an inflected form infrequently.

Words that were observed only once simply get dropped. Techniques like byte-pair encoding help, but more detailed work on this is warranted.

4. Long sentences: Encoding long sentences and generating long sentences is still a wide-open problem. MT systems degrade with sentence length, but NMT systems especially so. Using attention helps but the problem is far from “solved”. In many domains, such as legal, it is quite common to have long, complex sentences.

5. Attention != Alignments: This is a very subtle but important issue. In traditional SMT systems, like Phrase-based MT, alignments provide useful debugging information to inspect the model. But the attention mechanism cannot be viewed as alignments in the traditional sense, even if papers have often sold attention as “soft alignments”. In an NMT system, a verb in the target could be attending over the subject and object in addition to the verb in the source.

6. Difficult to control quality: Every word has multiple translations and typical MT systems score over a lattice of possible translations for a source sentence. To keep this lattice reasonable in size, we use beam search. By varying the beam width, it is possible to find low probability but correct translations. For NMT systems, tuning beam sizes seem to have no-to-adverse effects.

NMT systems are still hard to beat when you have a ton of data, and they are here to stay. There is also something to be said about the blackboxness of neural network models in general, and today’s NMT models (both LSTM and Transformer based) suffer from that. This is an active area of research, and I look forward to attending the EMNLP workshop on that topic if my calendar cooperates.