Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim word2vec WMD similarity dictionary

I'm using word2vec on a 1 million abstracts dataset (2 billion words). To find most similar documents, I use the gensim.similarities.WmdSimilarity class. When trying to retrieve the best match using wmd_similarity_index[query], the calculation spends most of its time building a dictionary. Here is a piece of log:

2017-08-25 09:45:39,441 : INFO : built Dictionary(127 unique tokens: ['empirical', 'model', 'estimating', 'vertical', 'concentration']...) from 2 documents (total 175 corpus positions)                                                        
2017-08-25 09:45:39,445 : INFO : adding document #0 to Dictionary(0 unique tokens: [])          

What does this part ? Is it dependent on the query ? Is there a way to do these calculations once for all ?

EDIT: training and scoring phases in my code:

Training and saving to disk:

w2v_size = 300
word2vec = gensim.models.Word2Vec(texts, size=w2v_size, window=9, min_count=5, workers=1, sg=1, hs=1, iter=20) # sg=1 means skip gram is used 
word2vec.save(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
corpus_w2v_wmd_index = gensim.similarities.WmdSimilarity(texts, word2vec.wv)
corpus_w2v_wmd_index.save(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)

Loading and scoring:

w2v = gensim.models.Word2Vec.load(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
words = [t for t in proc_text if t in w2v.wv]
corpus_w2v_wmd_index = gensim.similarities.docsim.Similarity.load(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
scores_w2v = np.array(corpus_w2v_wmd_index[words])  
like image 473
Robin Avatar asked Jan 29 '23 17:01

Robin


1 Answers

The "Word Mover's Distance" calculation is relatively expensive – for each pairwise document comparison, it searches for an optimal 'shifting' of semantic positions, and that shifting is itself dependent on the pairwise simple-distances between all words of each compared document.

That is, it involves far more calculation than a simple cosine-distance between two high-dimensional vectors, and it involves more calculation the longer the two documents are.

There isn't much that could be pre-calculated, from the texts corpus, until the query's words are known. (Each pairwise calculation depends on the query's words, and their simple-distances to each corpus document's words.)

That said, there are some optimizations the gensim WmdSimilarity class doesn't yet do.

The original WMD paper described a quicker calculation that could help eliminate corpus texts that couldn't possibly be in the top-N most-WMD-similar results. Theoretically, the gensim WmdSimilarity could also implement this optimization, and give quicker results, at least when initializing the WmdSimilarity with the num_best parameter. (Without it, every query returns all WMD-similarity-scores, so this optimization wouldn't help.)

Also, for now the WmdSimilarity class just calls KeyedVectors.wmdistance(doc1, doc2) for every query-to-corpus-document pair, as raw texts. Thus the pairwise simple-distances from all doc1 words to doc2 words will be recalculated each time, even if many pairs repeat across the corpus. (That is, if 'apple' is in the query and 'orange' is in every corpus doc, it will still calculate the 'apple'-to-'orange' distance repeatedly.)

So, some caching of interim values might help performance. For example, with a query of 1000 words, and a vocabulary of 100,000 words among all corpus documents, the ((1000 * 100,000) / 2) 50 million pairwise word-distances could be precalculated once, using 200MB, then shared by all subsequent WMD-calculations. To add this optimization would require a cooperative refactoring of WmdSimilarity.get_similarities() and KeyedVectors.wmdistance().

Finally, Word2Vec/Doc2Vec applications don't necessarily require or benefit much from stop-word removal or stemming. But because the expense of WMD calculation grows with document and vocabulary size, anything that shrinks effective document sizes could help performance. So various ways of discarding low-value words, or coalescing similar words, may be worth considering when using WMD on large document sets.

like image 171
gojomo Avatar answered Feb 02 '23 10:02

gojomo