The Technology Behind Machine Translation | Understanding with Unbabel

Remember the good old days when you were a toddler arranging syntactic trees like verbal bonsais? The hours spent analyzing
collocated parts of speech? Of course you don’t. Your first language just came to you, and then kept on coming to you. From infancy through
adulthood we acquire language as we rack up new experiences
that help us put things in context and make sense
of verbal ambiguity. It’s how we know that the sentence, yesterday, I went to the
bank and ran into my friend, describes a chance
encounter between friends, and not a senseless rampage. But if you’re a computer
trying to translate that sentence into another language, that distinction is anything but obvious. Yet, as of now, if you
pop that same example into a standard online translation engine and translate it to French, out comes this semantically
accurate bad boy. So, how does a computer
understand the original sentence without that big ol’ linguistically
dextrous brain of yours? For accurate machine translation and natural language processing we have word embeddings to thank. A word embedding is a sequence of numbers that stores information
about a word’s meaning. Engineers can assign numerical values representing different
properties of a word. From here, they can
analyze similarity between two or more words by
comparing those numbers. If you plot those numbers as
coordinates, say like this, you can quickly visualize the relationship between different terms. We can determine the
similarity between the words by analyzing the angle
between their vectors. Of course, this is just
a hypothetical example, and in practice engineers develop word embeddings like
these on a massive scale. An ordinary word embedding
might encode information about hundreds of properties, and machines need access
to millions of words to truly learn the embeddings. And, in fact, they’re not encoding for tangible properties
like furry or dangerous, since an algorithm can’t
really know a word’s meaning. If you train a language model on a sufficient number
and variety of examples the final product will be
a system that can predict the neighboring terms for any given word. That way, when a well-trained
machine translation engine runs into the word friend, it’ll know not to translate
the verb run into so literally. The British linguist John Rupert Firth described this approach to
language processing memorably. “You shall know a word
by the company it keeps.” Both humans and machines know a word by its verbal entourage, but
we still have the upper hand. Despite all the progress
in word embeddings, scientists still haven’t put
the language problem to rest. This isn’t the end of
the quest for machines that truly understand human language, it’s just the beginning.

6 thoughts on “The Technology Behind Machine Translation | Understanding with Unbabel

  1. John Rupert Firth looks like someone who would be very disappointed, not mad, if he caught you sneaking some cookies before dinner.

Leave a Reply

Your email address will not be published. Required fields are marked *