Crosslingual document comparison

The Tower of Babel by Bruegel

Introduction

How can you learn a map from a German language to an English language word vectorisation model, to enable cross-lingual document comparison?

The monolingual version of this task has a familiar solution: Given say a German language word vectoriser and a recipe for building document vectors from word vectors, one can measure the thematic similarity of two German documents by comparing their vectorisations, for example using cosine similarity. Let’s refine the above question in this more concrete context.

Question: Suppose now we have a second such model in English, together with a list of bilingual phrase pairs. How can we compare documents across the two languages?

Recently, I implemented an answer to this question described in the 2013 paper “Exploiting Similarities among Languages for Machine Translation” of Mikolov, Le and Sutskever. The main idea is simple: Learn a linear map that approximately sends each German phrase vector to the vectorisation of the English counterpart recorded in the list of bilingual phrase pairs. Applying to any German document will then position it near similarly-themed English documents.

Data & software acknowledgments

Bilingual word and phrase pairs were taken both from the dict.cc German-English dump, and from the (German-English) Europarl corpus, which is a bilingual transcription of the proceedings of the European Parliament, compiled by Koehn (see his paper “Europarl: A Parallel Corpus for Statistical Machine Translation”). The data from both these sources was used for research purposes only, and the code used to clean dict.cc prior to vectorisation is included below. The English and German word vectorisers were trained on English and German Wikipedia dumps, using the Facebook’s open source FastText software.

The learning task

To formalise the learning task, suppose we are given the English and the German word vectorisers and , a list of bilingual phrase pairs  together with a recipe for building document vectors as linear combinations of their constituent word vectors (for example, define the document vector to be the sum of the word vectors). The same recipe must be used for both languages.

Mikolov, Le and Sutskever’s idea is to compute a linear regression from the German language vector space to English language vector space, mapping the German phrase vectors onto their English counterparts as nearly as possible. Mathematically, the task is to learn a matrix minimising the sum , where and . Hence, a well-learned will have the property that e.g. and are nearby with respect to the Euclidean metric. Applying the linear map to the vectorisation of a German document will thus reposition close to vectorisations of similarly-themed English documents, since document vectors are linear combinations of the constituent word vectors.

Training workflow

To learn the matrix , the workflow was as follows:

  1. Train word vectorisation models on German and on English Wikipedia.

  2. Download and clean the text file version of the online dictionary dict.cc.

  3. Vectorise a subset of the bilingual phrase pairs from the second step using the models from the first step, first removing all pairs in which one of the phrases contained at least one word not in the vocabulary of the corresponding model. These vector pairs will form our training set. We will sample a test set from the remaining vectors.

Learn by applying a linear regression mapping the German phrase vectors to the English phrase vectors.

We explain each step in more detail.

1. Creating the vectorisers

We trained two word vectorisers, one on German Wikipedia and one on English Wikipedia. Each Wikipedia corpus contained around 463k long articles, and each comprised around 3GB of data. Both models were 100-dimensional, trained using FastText in CBOW mode, with negative sampling. The other hyperparameters were:

2. Obtaining the bilingual phrase pairs

We constructed a list of bilingual (German, English) phrase pairs by cleaning a downloadable text file version of the online dictionary dict.cc. The resulting list of pairs looked like:

I used the following python code to clean the raw file.

3. Processing and vectorising the phrase pairs

We filtered the list phrase pairs of (German, English) word pairs described above, throwing out all pairs in which one phrase contained at least one word not in the vocabulary of the corresponding model. After applying the German vectoriser to the German phrases and the English vectoriser to the English phrases, the result was a pair (gdf, edf) of python dataframes, each of shape (644320, 100). The row of gdf (resp., edf) contained the vectorisation of the phrase in the German (resp, English) vocabulary. From this pair of dataframes, we selected 600,000 vector pairs lines on which to train the linear regression. The remainder were held out, to be used for evaluating  .

4. Computing

with linear regression

Using numpy, we computed a linear regression mapping the 600k training lines of gdf to edf. Here is the code:

Note: This function returns a matrix such that , whereas the objective function defined above implies . But implies , so the function solves the transpose of the required problem, which is equivalent.

Evaluation

How can we evaluate the quality of the matrix ? By definition, if has been well-learned, the image of any German document vector will live in a neighbourhood of similarly-themed English documents (hence the same is true of words, since words are short documents). We check if this is so both by eye — inspecting neighbours of the anglicisations of some German word and document vectors — and by computing a metric on the test set of bilingual vectorisations.

Inspection by eye

Below are the ten nearest neighbours to three German words:”Himmel” (Engl. “sky” or “heaven”), “Katze” (Engl. “cat”) and “Bergbau” (Engl. “mining”):

  1. Word = “Himmel”:

2. Word = “Katze”:

3. Word = “Bergbau”:

We now inspect the neighbours to each of two query documents. Each of the ten entries in both outputs lists the first 200 characters of the article — lowercased and stripped of punctuation, and ignoring word boundaries — and is preceded by the cosine similarity. All German language entries are coloured orange, and followed by English translations from Google translate. In both cases the closest match is the article itself, as one might expect.

The first query document is the English Wikipedia article about the Dutch national holiday ‘Koningsdag’, celebrating the King’s birthday:

0: (1.000000) koningsdag koningsdag or king day is national holiday in the kingdom of the netherlands celebrated on april the if the falls on sunday the date marks the birth of king willem alexander from to the day

1: (0.962013)  koningsdag koningsdag bzw ist nationalfeiertag in den niederlanden inklusive der besonderen gemeinden in der karibik in curaçao in sint maarten und in aruba an diesem tag feiern die niederländer den

Koningsdag koningsdag or national holiday in the netherlands including the special parishes in the carribean curaçao in sint maarten and in aruba on this day celebrate the niederländer

2: (0.948780)  goldenes thronjubiläum von elisabeth ii das goldene thronjubiläum von elisabeth ii engl golden jubilee war im jahr das jubiläum der thronbesteigung der britischen königin elisabeth ii mit den interna

Golden thronjubiläum of elisabeth ii the golden thronjubiläum of elisabeth ii engl golden jubilee was in the year the jubilee of the throne of the british queen elisabeth ii with the

3: (0.947533)  silbernes thronjubiläum von elisabeth ii das silberne thronjubiläum von elisabeth ii engl silver jubilee war im jahr das jubiläum der thronbesteigung der britischen königin elisabeth ii aus diesem an

Silver thronjubiläum of elisabeth ii the silver thronjubiläum of elisabeth ii engl silver jubilee was the year of the jubilee of the throne of the british queen elisabeth ii from this

4: (0.942364) silver jubilee of elizabeth ii the silver jubilee of elizabeth ii marked the anniversary of queen elizabeth ii accession to the thrones of the united kingdom canada australia new zealand and other com

5: (0.941474)  beatrix niederlande beatrix januar in baarn als beatrix wilhelmina armgard prinzessin von oranien nassau prinzessin zur lippe biesterfeld war vom april bis zum april als sie das amt ihrem sohn willem

Beatrix netherlands beatrix january in baarn as beatrix wilhelmina armgard princess of oranien nassau princess to lippe biesterfeld was from april until april when she was the office her son willem

6: (0.940599)  diamantenes thronjubiläum von elisabeth ii das diamantene thronjubiläum von elisabeth ii engl diamond jubilee war im jahr das jubiläum der thronbesteigung der britischen königin elisabeth ii es fande

Diamond thronjubiläum of elisabeth ii the diamond thronjubiläum of elisabeth ii engl diamond jubilee was the year of the jubilee of the throne of the british queen elisabeth ii it found

7: (0.939123)  juliana niederlande juliana louise emma marie wilhelmina deutsch auch juliane april in den haag märz in soestdijk prinzessin von oranien nassau herzogin zu mecklenburg war vom september bis zum april

Juliana netherlands juliana louise emma marie wilhelmina german also juliane april in the hague march in soestdijk princess of oranien nassau duchess zu mecklenburg was from september to april

8: (0.938922) father day father day is celebration honoring fathers and celebrating fatherhood paternal bonds and the influence of fathers in society many countries celebrate it on the third sunday of june but it i

9: (0.938111) juliana of the netherlands juliana juliana louise emma marie wilhelmina april march was the queen regnant of the kingdom of the netherlands between and she was the only child of queen wilhelmina and p

The second query document is an English Wikipedia article on “roadkill cuisine”:

0: (1.000000) roadkill cuisine roadkill cuisine is preparing and eating roadkill animals hit by vehicles and found along roads it is practice engaged in by small subculture in the united states southern canada the

1: (0.937038) dog meat dog meat refers to the flesh and other edible parts derived from dogs human consumption of dog meat has been recorded in many parts of the world including ancient china ancient mexico and anc

2: (0.935527)  heimtier heimtiere sind tiere die vom menschen aus verschiedenen motiven meist in der wohnung oder in sonstigem engen kontakt mit ihm gehalten werden motive für die haltung von heimtieren können sein

Pet animals are animals that are kept by people from different motives usually in the home or in other close contact with him motives for the attitude of animals can be

3: (0.934084)  hundefleisch hundefleisch wird in einigen ländern als nahrung genutzt zum beispiel in korea vietnam und einigen südlichen und nördlichen regionen chinas wie guangdong und beijing allerdings hat sich

Dog meat is used in some countries as a food for example in korea vietnam and some southern and northern regions china like guangdong and beijing however has itself

4: (0.932840)  kängurufleisch als kängurufleisch in australien roo wird das fleisch aller australischen känguruarten bezeichnet es ist nicht nur in australien erhältlich sondern wird vor allem ins ausland exportier

Kangaroo as kangaroo in australia roo is the meat of all australian kangaroo species, it is not only available in australia but is mainly exported to abroad

5: (0.932549) tree squirrel tree squirrels include over hundred species that are found on all continents except antarctica and are the members of the squirrel family sciuridae most commonly referred to as squirrels

6: (0.929929)  hoden lebensmittel hoden verschiedener tierarten zählen zu den innereien und sind generell essbar in mitteleuropa ist ihre verwendung in der küche heute unüblich war früher jedoch relativ verbreitet

Foodstuffs of various animals are among the innards and are generally edible in Central Europe, their use in the kitchen is unusual today but was relatively common

7: (0.928278)  katzenfleisch katzenfleisch dient in jeweils mehr oder weniger geringem umfang unter anderem in südchina nordvietnam korea peru und großbritannien zu nahrungszwecken in notzeiten wurde auch katzenfle

Cat meat is mainly used in south china north korea korea peru and britain for food purposes in emergency times also katzenfle