Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word2Vec: Using Gensim and Google-News dataset- Very Slow Execution Time

The Code is in python. I loaded up the binary model into gensim on python, & used the "init_sims" option to make the execution faster. The OS is OS X. It takes almost 50-60 seconds to load it up. And an equivalent time to find "most_similar". Is this normal? Before using the init_sims option, it took almost double the time! I have a feeling it might be an OS RAM allocation issue.

model=Word2Vec.load_word2vec_format('GoogleNewsvectorsnegative300.bin',binary=True)
model.init_sims(replace=True)
model.save('SmallerFile')
#MODEL SAVED INTO SMALLERFILE & NEXT LOAD FROM IT
model=Word2Vec.load('SmallerFile',mmap='r')
#GIVE RESULT SER!
print model.most_similar(positive=['woman', 'king'], negative=['man'])
like image 522
Nachiappan Chockalingam Avatar asked Sep 23 '16 09:09

Nachiappan Chockalingam


1 Answers

Note that the memory-saving effect of init_sims(replace=True) doesn't persist across save/load cycles, because saving always saves the 'raw' vectors (from which the unit-normalized vectors can be recalculated). So, even after your re-load, when you call most_similar() for the 1st time, init_sims() will be called behind the scenes, and the memory usage will be doubled.

And, the GoogleNews dataset is quite large, taking 3+ GB to load even before the unit-normalization possibly doubles the memory usage. So depending on what else you've got running and the machine's RAM, you might be using swap memory by the time the most_similar() calculations are running – which is very slow for the calculate-against-every-vector-and-sort-results similarity ops. (Still, any most_similar() checks after the 1st won't need to re-fill the unit-normalized vector cache, so should go faster than the 1st call.)

Given that you've saved the model after init_sims(replace=True), its raw vectors are already unit-normalized. So you can manually-patch the model to skip the recalculation, just after your load():

model.syn0norm = model.syn0

Then even your first most_similar() will just consult the (single, memory-mapped) set of vectors, without triggering an init_sims().

If that's still too slow, you may need more memory or to trim the vectors to a subset. The GoogleNews vectors seem to be sorted to put the most-frequent words earliest, so throwing out the last 10%, 50%, even 90% may still leave you with a useful set of the most-common words. (You'd need to perform this trimming yourself by looking at the model object and source code.)

Finally, you can use a nearest-neighbors indexing to get faster top-N matches, but at a cost of extra memory and approximate results (that may miss some of the true top-N matches). There's an IPython notebook tutorial in recent gensim versions at annoytutorial.ipynb IPython notebook of the demo IPython notebooks, in the gensim docs/notebooks directory.

like image 120
gojomo Avatar answered Nov 17 '22 19:11

gojomo