I'm using word2vec on a 1 million abstracts dataset (2 billion words). To find most similar documents, I use the gensim.similarities.WmdSimilarity
class. When trying to retrieve the best match using wmd_similarity_index[query]
, the calculation spends most of its time building a dictionary. Here is a piece of log:
2017-08-25 09:45:39,441 : INFO : built Dictionary(127 unique tokens: ['empirical', 'model', 'estimating', 'vertical', 'concentration']...) from 2 documents (total 175 corpus positions)
2017-08-25 09:45:39,445 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
What does this part ? Is it dependent on the query ? Is there a way to do these calculations once for all ?
EDIT: training and scoring phases in my code:
Training and saving to disk:
w2v_size = 300
word2vec = gensim.models.Word2Vec(texts, size=w2v_size, window=9, min_count=5, workers=1, sg=1, hs=1, iter=20) # sg=1 means skip gram is used
word2vec.save(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
corpus_w2v_wmd_index = gensim.similarities.WmdSimilarity(texts, word2vec.wv)
corpus_w2v_wmd_index.save(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
Loading and scoring:
w2v = gensim.models.Word2Vec.load(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
words = [t for t in proc_text if t in w2v.wv]
corpus_w2v_wmd_index = gensim.similarities.docsim.Similarity.load(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
scores_w2v = np.array(corpus_w2v_wmd_index[words])
The "Word Mover's Distance" calculation is relatively expensive – for each pairwise document comparison, it searches for an optimal 'shifting' of semantic positions, and that shifting is itself dependent on the pairwise simple-distances between all words of each compared document.
That is, it involves far more calculation than a simple cosine-distance between two high-dimensional vectors, and it involves more calculation the longer the two documents are.
There isn't much that could be pre-calculated, from the texts
corpus, until the query's words are known. (Each pairwise calculation depends on the query's words, and their simple-distances to each corpus document's words.)
That said, there are some optimizations the gensim WmdSimilarity
class doesn't yet do.
The original WMD paper described a quicker calculation that could help eliminate corpus texts that couldn't possibly be in the top-N most-WMD-similar results. Theoretically, the gensim WmdSimilarity
could also implement this optimization, and give quicker results, at least when initializing the WmdSimilarity
with the num_best
parameter. (Without it, every query returns all WMD-similarity-scores, so this optimization wouldn't help.)
Also, for now the WmdSimilarity
class just calls KeyedVectors.wmdistance(doc1, doc2)
for every query-to-corpus-document pair, as raw texts. Thus the pairwise simple-distances from all doc1
words to doc2
words will be recalculated each time, even if many pairs repeat across the corpus. (That is, if 'apple' is in the query and 'orange' is in every corpus doc, it will still calculate the 'apple'-to-'orange' distance repeatedly.)
So, some caching of interim values might help performance. For example, with a query of 1000 words, and a vocabulary of 100,000 words among all corpus documents, the ((1000 * 100,000) / 2)
50 million pairwise word-distances could be precalculated once, using 200MB, then shared by all subsequent WMD-calculations. To add this optimization would require a cooperative refactoring of WmdSimilarity.get_similarities()
and KeyedVectors.wmdistance()
.
Finally, Word2Vec/Doc2Vec applications don't necessarily require or benefit much from stop-word removal or stemming. But because the expense of WMD calculation grows with document and vocabulary size, anything that shrinks effective document sizes could help performance. So various ways of discarding low-value words, or coalescing similar words, may be worth considering when using WMD on large document sets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With