A comparison of sentence embedding techniques by Prerna Kashyap, our RARE Incubator student. As her graduation project, Prerna implemented sent2vec, a new document embedding model in Gensim, and compared it to existing models like doc2vec and fasttext.

What are sentence embeddings?

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. Word embeddings are representation of words in an N-dimensional vector space so that semantically similar (e.g. “king” — “monarch”) or semantically related (e.g. “bird” — “fly”) words come closer depending on the training method (using words as context or using documents as context). When it comes to texts, one of the most common fixed-length features is the bag-of-words. But this method neglects a lot of information like ordering and semantics of the words. For example, two sentences can have identical final representation but entirely different meanings; like ‘I’m going to study Math instead of English’ and ‘I’m going to study English instead of Math’.

What is Sent2Vec?

In the paper Unsupervised Learning of Sentence Embeddings using Compositional N-Gram Features, a new model for sentence embeddings called Sent2Vec is introduced. Sent2Vec presents a simple but efficient unsupervised objective to train distributed representations of sentences. It can be thought of as an extension of FastText and word2vec (CBOW) to sentences. The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams of words present in each sentence, and averaging the n-gram embeddings along with the words.

Sent2Vec seemed like an interesting model that could be incorporated in Gensim and so my task for the student incubator was to come up with a correct and efficient native implementation of Sent2Vec in Gensim.

How is Sent2Vec different from FastText?

Sent2Vec predicts from source word sequences to target words, as opposed to character sequences to target words.

Native implementation of Sent2Vec in Gensim

I started off by reading the paper and going through the original C++ code open-sourced by the authors that builds upon Facebook’s Fasttext. The first version of the code I came up with was a pure Python/Numpy implementation and was consequently pretty slow. So, I looked for hotspots in the code that took up the most time and implemented those in Cython. Selecting which part to optimize was an easy task—especially with profiling. It was clear that the bulk of the work was done in the nested loop that went through each sentence, and for each word in the sentence tried to predict all the other words within its window. There was a significant speedup by cythonising the various hotspots.

Parallelization of cythonised code

The next step was to make the code run in parallel, to make use of multicore machines. Due to the simplicity of the model, parallel training is straight-forward using parallelized or distributed SGD. In standard Python, the GIL aka Global Interpreter Lock enforces single-threaded execution for CPU-bound tasks. But with Sent2vec’s cythonised version I was already working with low-level C++ calls so the GIL was no issue. Sent2Vec’s multithreaded version is inspired by Gensim’s Word2Vec, which accepts a generic sequence of sentences and a fixed number of sentences is taken and put in a job queue, from which worker threads repeatedly lift jobs for training. The core computation is done inside _do_train_job_util, which is the function I optimized with Cython and BLAS. We must release the GIL for multithreading to be practical, and this is also done inside _do_train_job_util, using Cython’s nogil syntax.

Benchmarking and Evaluation

For benchmarking, I used the Toronto Corpus, which is a private corpus. The Toronto Book Corpus has all sentences in 11,038 books. It has about 70 million sentences and 0.9 billion words. Only 7,087 out of 11,038 books in Book Corpus are unique. Among them 2089 books have one duplicate, 733 books have two and 95 books have more than two duplicates. Maximum sentence length of 300 is used, i.e. sentences having length greater than 300 are ignored. Comparison is done with Gensim’s Doc2Vec (DBOW and DM) and Gensim’s FastText model.

Examples of usage: from gensim.models import Sent2Vec from gensim.test.utils import common_texts model = Sent2Vec(common_texts, size=100, min_count=1)

For online training:

from gensim.models import Sent2Vec from gensim.test.utils import common_texts new_sentences = [[‘computer’, ‘artificial’, ‘intelligence’], [‘artificial’, ‘trees’], [‘human’, ‘intelligence’], [‘artificial’, ‘graph’], [‘intelligence’], [‘artificial’, ‘intelligence’, ‘system’]] model = Sent2Vec(common_texts, size=100, min_count=1) model.build_vocab(new_sentences, update=True) model.train(new_sentences, total_examples=model.corpus_count, epochs=model.epochs)

Here common_texts is a list/stream of sentences, size is the size of the sentence embeddings and min_count refers to the minimum frequency of words. Several other parameters for the model can also be specified like epochs (number of iterations over the training corpus), dropout_k (number of ngrams dropped while training the model), workers (number of worker threads used for training to ensure faster training on multicore machines), etc. For a full list of parameters that Sent2Vec supports take a look at the documentation.

Sentence embeddings are evaluated for various tasks. I evaluated classification of product reviews (CR) (Hu and Liu, 2004), opinion polarity (MPQA) (Wiebe et al, 2005), movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ) (Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). To classify, I used the SentEval evaluation toolkit for sentence embeddings.

The sentence embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the MR, CR, SUBJ, MPQA datasets. For these datasets nested cross-validation is used to tune the L2 penalty. For the TREC dataset, the accuracy is computed on the test set.

Evaluation Results

|S.No.|Model|MR|CR|SUBJ|MPQA|TREC|

Model

MPQA |1.|Gensim Sent2Vec|63.72|73.38|79.45|75.66|58.2|

Gensim Sent2Vec

73.38

75.66 |2.|Gensim Doc2Vec DM|50.23|62.41|51.09|66.74|22.2|

Gensim Doc2Vec DM

62.41

66.74 |3.|Gensim Doc2vec DBOW|51.62|57.4|53.68|67.15|26.2|

Gensim Doc2vec DBOW

57.4

67.15 |4.|Gensim FastText|68.39|74.12|84.95|81.18|61.4|

Gensim FastText

74.12

81.18

Examples where Sent2Vec outperforms Doc2Vec

Eg: In the question type classification task (TREC) Doc2Vec performs pretty poorly. The TREC dataset contains 6 types of question classes, namely ENTY(entity), ABBR(abbreviation), LOC(location), HUM(human), NUM(numeric) and DESC(description). Some questions which Sent2Vec is able to classify correctly and Doc2Vec isn’t are:

How far is it from Denver to Aspen? (actual label: NUM , predicted label: DESC )
Who was Galileo? (actual label: HUM , predicted label: DESC)
What is an atom? (actual label: DESC , predicted label: HUM)
How tall is the Sears building? (actual label: NUM, predicted label: DESC)

The number of correctly predicted labels out of total number of labels for Doc2Vec and Sent2Vec are as follows: (NOTE: The following numbers have been computed without 10 fold cross validation in contrast to the above mentioned evaluation table. In this case TREC test accuracy for SentVec and Doc2Vec is 55.4% and 30.6% respectively)