When it comes to machine translation research, few names are as prominent or respected as Philipp Koehn. With over 100 publications to his name, Koehn is an iconic leader in the field, renowned for his groundbreaking research and innovative contributions to the development of machine translation systems. Currently a professor of computer science at Johns Hopkins University and the University of Edinburgh, Koehn continues to push the boundaries of machine translation through his work with the Centre for Language and Speech Processing and the Statistical Machine Translation Group. But Koehn’s influence extends far beyond academia; as the Chief Scientist at Omniscien Technologies, he’s also at the forefront of the commercial deployment of machine translation systems. Under Koehn’s leadership, the open-source Moses system has become the gold standard for machine translation research and deployment. In this post, we’ll explore Koehn’s contributions to the field of machine translation, and look at some of the exciting research projects he’s led over the years. We’ll also delve into the latest trends and innovations in machine translation technology, and what the future holds for this rapidly evolving field.
Philipp, please explain machine translation in simple terms…
To understand documents in different languages you need to have a translation. That’s a process that humans have been doing forever, and it’s always been one of the Holy Grails of natural language processing. Over the last twenty to thirty years machine translation has gone from the point of being really poor, to really quite useful.
Why have you dedicated your life to studying machine translation?
When I started studying computer science in the early nineties, I realised relatively quickly that it’s bad
to work on machine learning for machine learning’s sake, without having a problem. At that time, text processing was a really good practical problem because you have the data and you can actually do machine learning, and it’s a somewhat feasible task where there is at least some idea about what the correct input and the correct output should be.
Approaches to machine translation tend to centre around the two concepts of adequacy and fluency. Can you talk us through that?
There are always two goals for translation. Firstly, in terms of fluency, translators want to produce texts that aren’t noticeably translated. Adequacy, on the other hand, is when the translated meaning conflicts with the original text and should be amended. If you think about translation of literature, fluency is more important than adequacy; the book should be enjoyable rather than have all the facts. For example, if I wrote a story for an American newspaper that a town ‘has the population size of Nebraska’, this is understandable in America, but not for someone in China reading a Chinese translation.
We have a model that asks: ‘is this a fluent sentence?’ and a second one to see how well things map. This is then balanced and applied to the type of text being worked on.
How do you quantify the performance of such machine translation systems? What kind of metrics are useful for that?
The model I just discussed has components that make sense within the model, but ultimately the goal is to produce translations. This raises two questions: ‘how do you evaluate?’ and ‘what is a good translation?’ We have an engineering problem in that we want to build machine translation systems and then tune and change them, and be able to measure the results immediately. This means we need an automatic metric to evaluate how good machine translation is.
Are performance metrics more important internally, to train and develop the models?
There’s the infamous BLEU score that is used in machine translation. Ideally you want to know how many words are wrong, but you also have to consider word order, which is not easy to do. The BLEU score looks at how many words are right and also which pairs, triplets and forward sequences are right. You then compare it against the human translation.
The best human translation is disputed amongst translators, however if there are flaws in a sentence, two opposed human translations will have different flaws. Why is one flaw worse than another flaw? We have a reasonably useful set-up for the last twenty years since the BLEU score was invented. They have definitely helped guide the development of machine translation.
One of the earlier ideas of machine translation was to split the problem into three categories; a lexical, a syntactic, and a semantic problem. Is this still a valid approach in the age of neural networks?
Yes, and no. Twenty or thirty years ago, there was a grand vision of machine translation being an application that guides the development of better natural language processing. That involved understanding language by going through various processing stages, such as finding the nouns, the verbs to handle morphology, and detecting syntactic structure. Ultimately, the vision was always to have some meaning representation that is beyond all language. If you take a source language and map it to that meaning representation beyond our language and then generate from that, you can build machine translation systems for every language pair.
The goal of interlingual systems was to build an analyser and a generator for each language. The statistical revolution that happened twenty years ago disregarded this theory and said it’s just a word-mapping problem – a problem based only on finding source words and mapping them to target words. We have to have some kind of model of reordering but it’s all tied to words, so it was a very superficial model. It just looked at word sequences.
These results were generally good, except for the grammatical accuracy, because these systems only looked at very short sequences of five words at a time. For instance, you could get to the end of a sentence that ended without a verb to give it meaning.
We developed quite successful systems that actually built syntactic structures with noun phrases, clauses, and so on. Work was initially done focussing on Chinese to English, where the word order and structure was much trickier. We were also really successful in German to English, which had been a problem for natural language processing. German and English are related languages, but the syntax is vastly different.
We’ve made some progress towards building linguistically better models, but they have become
more complicated because you had to build these structures. Then, there was talk about semantic representations and graph structures that are even more difficult. That’s where the neural machine translation wave hit again and started out by saying there are just two ways of word mapping problems – the sequence of where it’s coming in and producing a sequence of output words.
With these new machine learning methods, why do you think other fields are now statistically driven? Are neural graphs matching language better? If so, is it due to hardware or increased data?
There are various aspects to that question. The turn to data-driven methods and natural language processing is pretty much parallel to what I just described about machine translation. Other problems in natural language processing, for instance, analysing syntactic structures, was often done by handwritten rules or had a sentence as a subject object. A subject is a noun-phrase; a noun-phrase is determined as an ‘adjective noun’. However, if you look at the actual texts, something often violates how language should look. In the nineties, it was rebuilt. Now it just annotates sentences with their syntactic trees.
Why has it completely overtaken the field? Because you can get all your training data for free. All you need is translated text, which is what people use all the time. It’s extremely rare that we actually annotate training data ourselves. We find it from the internet or from public repositories.
Humans also have different ways of learning language. Do we learn language mainly from rules, or do we just absorb it?
I’m not a linguist and we don’t have a linguistic theory. We listen to language first and then go to school, realising we make grammatical mistakes, and then we are taught some rules. Is language driven by rules or just an amalgamation and repetition of what people say? It seems to be a mix of both. There’s some structure, but there’s also evidence of language being recursive. For example, my kid said to me: “he be vibing though”, which is not grammatically correct, but that’s what people say and it becomes part of language through repetition.
There are attempts now to process even sequences of characters, what’s driving this process?
The fundamental problem is that everything is incredibly ambiguous in language. The classic example of this is ‘river bank’; we have banks for our money but a ‘river bank’ means something completely different. That’s why it’s hard for a computer because it has to resolve the ambiguity. How can a computer ever
tell the difference between a financial bank or river bank? They’re just banks. They’re just character sequences of four letters – b, a, n, k.
To resolve that, machines can take the context of the preceding word, ‘river’, and apply this to ‘river bank’. If you do phrase translation it becomes much less ambiguous. For example, ‘interest rate’ can be accurately translated, but it is much more ambiguous when translated individually. It’s somewhat similar when you get to sub-words and character sequences: if you take two words and say they’re different, what will you do with ‘car’ and ‘cars’, for instance? You will still understand what it means to add the plural. We need to get away from representing car and cars. Looking at the character sequences we see that they’re very similar, and that should help.
Is this one of the reasons the field is currently relying so heavily on the recurrent neural networks?
You have an input sentence and you have an output sentence. The output sentence, if you predict one word at a time, has all the previous words to help disambiguate. What drives the decision to produce the next word is obviously the input sentence but also all the previous words produced. This creates a recurring process where the big question is: is language recursive or is it just a sequence?
There are good reasons to believe that language is heavily influenced by the latter. When we understand language we always receive it linearly; we read and listen to things word by word. We don’t look at the entire sentence, find the verb and then branch out again. We just see it as a sequence, so it should be modelled as a sequence generating task too, where you produce one word after the other. This makes it a bit more feasible. You still have a fairly large vocabulary of hundreds of thousands of words. You can break up infrequent words into sub-words to make it computational but this recurrent process of producing one word at a time suits language well.
What type of neural network are you using? Input sentences are encoded and output sentences decoded?
You have an input sentence and you try to work out the meaning of that sentence. During rule-based days in particular, this was done explicitly. Representations closely mirrored our understanding of meaning,
or at least syntactic structure in neural networks. There are claims that this kind of meaning emerges in the middle of the process – going from an input sentence to an output sentence. We look at the source sentence and encode that source, then the decoder generates the output sentence.
Recurrent neural networks started in machine translation five years ago, which is now ancient. Since around 2019, we have a different model that is called a transformer (a more informative name is ‘self- attention’). The idea is that we are modelling words in the context of the other words and we do this very clearly. We refine the purpose of the word, give them the surrounding word, and we go through layers of that, so this is the self-attentional transformer approach. We have the same thing on the source and the target side.
What can typically go wrong with machine translation and which methods are used to validate the outputs?
Output validation asks, ‘how well did your system do?’ then leaves aside a number of sentences, translates them and checks how well they match with human translation. We can measure the probability of each word with the human translation and we can also make it produce the human output to see how well it scored.
So, what can go wrong? An interesting thing about the neural machine translation approach is it differs in the types of errors. Statistical methods (because they’ve had a very narrow window in what they looked at) often produced output with incoherent outputs. In the neural model, if the input has unusual words the output will be gibberish. With little training data, they often produce beautiful sentences that sound like biblical prophecies and have nothing to do with the input.
Why is this? If you don’t have much data, you have the Bible and Koran in hundreds of languages in which to train your model. If you then want to translate tweets the model is thinking, ‘I have no idea what that input is but here’s a beautiful sentence I’ve seen in training. How about that?’. It’s effectively a type of hallucination. That’s a real problem because this output can fool you. How do you know, for instance, if a translated Chinese document into English is ‘beautiful gibberish’ or an accurate translation? That’s the fluency-related problem.
There is also a problem of adequacy – do we translate the words correctly and how do we handle ambiguous words? Previously, you just got gibberish output and didn’t trust it. Now you get a beautiful output and there’s no reason not to trust it.
What datasets are you using to train your machine translation?
Everything we can get our hands on. A long time ago I came across the European Parliament website that had public debates translated into all the official languages, which back then, were 13 languages….now they’re 28 languages. You can just download all the webpages. It’s very easy to figure out which blocks of text belong where, so it’s not that hard to then break these down into sentences.
It’s a big data resource that was used for a long time, there’s about 50 million words of translated text for all the official EU languages and a lot of the other publicly available datasets are similar. Open subtitles are interesting too. Currently, people like to translate. I think this comes also from pirating TV shows, from English to Chinese, for instance. People create subtitles and then translate the subtitles, so there’s actually hundreds of millions of words from translated subtitles (although the quality is not always great).
We have a big project right now, running for three or four years, with the University of Edinburgh and other groups in Spain where we go out on the web and find any website there is. This is something Google has been doing since the very beginning when they got engaged in machine translation. They had the advantage, they already downloaded the entire internet because of their search engine. They do better because they have more data, but we do have access to the internet and can download everything and we’ve tried finding translating texts on them.
We usually find goldmines of good data, where there is consistent translation that’s nicely formatted. For the biggest languages, you can get billions of words – more than you can read in your lifetime, but for the lesser known languages there is not much at all.
You mentioned using data from the European Parliament. Is that language not particularly specific, and would it not have an impact on the translation?
It depends on what the application is. We’ve been using machine translation in the academic community for the last 15 years. We used news stories as a test because it’s tough and has a very broad domain. I can talk about sports, natural disasters or political events and it’s relatively complex language – the average sentence length is around 30 words. We found the European Parliament proceedings very useful because they talk about the same subjects, using particular language.
Translation of speech and spoken language is very different from written language, even parliamentary proceedings, which are normally spoken. However, there is a mismatch between spoken language data to train the machines and edited official publications. The two are very different and it’s a real problem.
What role does infrastructure and technology play for machine translation?
It’s been pretty computer-heavy because of large data sizes and the gigabytes as training data. A student at a university could do meaningful research with publicly available data. There was a lot of open source code – you can download the software, run it, and then work on improvements. A single computer was enough for that.
This changed because of neural networks. You need GPU servers and the more computer sources you have, the better the results. You can build more complex models and measure the complexity of neural networks by how many layers it has. You can build models with six or seven layers or you can have models with twenty layers, except they’re slower to train. For the big language pairs where you do have a billion word corpus, training takes weeks.
The problem we have in academia is competing with industry labs that easily have 1,000 GPUs with each one costing up to $2,000. You need to put it in a computer, so a computer with four GPUs costs $10,000. For academic institutions, we have about 100 or 200 GPUs available for 50 PhD students (and that’s with us being the centre for language and speech processing).
Academia simply can’t compete with lab experimentation. I read about a language model that trained 1,000 GPUs in a week, and I thought, ‘that’s the end for us.’ There’s also a model called GPT3, which some of your readers may be familiar with. That’s a big language model utilising around 50,000 GPU in a day with opportunity for more in future – and that’s just inconceivable, so this does limit what we can practically do with our models.
How does academic research translate into industry applications?
Ultimately, students and researchers work on what’s fun, but we are guided by big funding projects, too. In the US, DARPA have been funding machine translation and they’re interested in understanding foreign language text. More recently, languages not covered by Google have been the focus. I’ve worked on Somalian and Ethiopian languages for example, which drives some research.
Generally, in machine translation research, academia is not that concerned about the end applications. The bar isn’t as high as Google translate. Mistakes are okay, so long as they’re understandable. Facebook has a similar problem in that people post in different languages and the translation needs to be understandable, which is even tougher when people write slang or error-ridden sentences. This can be very hard to translate.
Commercial application for translation is actually completely different. Most companies who want to globalise their products have to translate marketing materials, for instance, Omniscien Technology is one of the big areas being worked on. That includes subtitle translation for movies and TV. That quality bar is much higher, because consumers expect to read it without any errors, otherwise it will just annoy them. This is where human translation is still relevant.
How far do you think machine translation is away from passing a Turing test?
I’m not going to make predictions for when we will have flawless translation. Machine translation has a history of overselling and under-delivering and going through various hype cycles. I think a good measure of machine translation should be, is it good enough for a particular purpose? If I go to a French newspaper website, for instance, and there’s a story about President Macron, I can run it through Google translate and I can perfectly understand the story, even if some things are missing. But that’s good enough. If I want to buy a Metro ticket in Paris and the translation of the website allows me to buy it, it’s good enough. It doesn’t have to be perfect.
Another measure is, does machine translation make professional human translators more productive?
If you can make them twice as fast, that saves an enormous amount of money – and that’s generally the measuring stick. If you translate perfectly, you could construct any kind of intelligence test as a translation challenge and basically write a story in any way, and you can check it. As an example, ‘cousins’ in English is not gendered, but if I translate that into German, I have to pick a gender for that, so you have to alter the meaning a bit when translating. This kind of world knowledge is a deep AI knowledge that we don’t have right now.
We’re not close to perfect translation and we don’t have to be. I think it’s an impossible task anyway. There’s always going to be someone that says, “no, that’s not right.”
You mentioned the black box field of the neural networks you’re training, do you see evidence that language might be an emerging property of a complex system?
I think it’s a very interesting question. What does it actually say, for instance, about image recognition or language? There is a lot of physics envy to reduce the world to a few formulae but that doesn’t seem to work for language, where you have a few rules and that’s it. We can discover principles that are true 90% of the time. The rule is proven by its contradiction. There’s always an exception that proves the rule and language is a lot like this.
I think this is definitely one of the great challenges, trying to understand what’s going on.
To learn more about Machine translation, Philipp’s most recent book, Neural Machine Translation is a great place to start. Click here for the link: https://www.amazon.co.uk/Neural-Machine-Translation-Philipp-Koehn/ dp/1108497322