The Problem With Word Embeddings
Refreshing Word2Vec embeddings is a pain. Say that you’d like to change the dimensions of your embeddings or to use a better trained model, you’d need to recalculate almost everything that you have used embeddings for, because the new embeddings are totally different. This applies to any modeling, unsupervised learning or visualizations.
The Solution
It turns out that maybe you don’t need to recalculate everything. Word2Vec embeddings can be translated from one to another as long as the relationship between words are the same.
I trained 2 different w2v models on the IMDB reviews. The vocabulary and corpus stayed the same. However, the embeddings for the words are totally different.
1 |
|
model2 - “one” : -5.90761974e-02, -3.17945816e-02, 1.26407698e-01
1 |
|
[Original embeddings] dot [translation matrix] = [New embeddings]
1 |
|
[2000 x 500] dot [500 x 200] = [2000 x 200]
```
Evaluating the Translation
To evaluate the quality of the embeddings, we can use words that we didn’t use for training. I recommend measuring how often the map of the old embedding can be mapped to the same word.
For instance, if “one” is mapped to “one”, “a” and “an” in the new space, it is a hit. However, if “one” is mapped to “time”, “star” and “tree”, it is a miss.
Applications
The method can be applied in many ways:
-
Introducing new vocabulary to your model
-
Translating between languages (e.g. English - French, Android - iOS)
-
Increasing or decreasing the dimensionality of your embeddings