The Code is in python. I loaded up the binary model into gensim on python, & used the "init_sims" option to make the execution faster. The OS is OS X. It takes almost 50-60 seconds to load it up. And an equivalent time to find "most_similar". Is this normal? Before using the init_sims option, it took almost double the time! I have a feeling it might be an OS RAM allocation issue.
model=Word2Vec.load_word2vec_format('GoogleNewsvectorsnegative300.bin',binary=True)
model.init_sims(replace=True)
model.save('SmallerFile')
#MODEL SAVED INTO SMALLERFILE & NEXT LOAD FROM IT
model=Word2Vec.load('SmallerFile',mmap='r')
#GIVE RESULT SER!
print model.most_similar(positive=['woman', 'king'], negative=['man'])
Note that the memory-saving effect of init_sims(replace=True)
doesn't persist across save/load cycles, because saving always saves the 'raw' vectors (from which the unit-normalized vectors can be recalculated). So, even after your re-load, when you call most_similar()
for the 1st time, init_sims()
will be called behind the scenes, and the memory usage will be doubled.
And, the GoogleNews dataset is quite large, taking 3+ GB to load even before the unit-normalization possibly doubles the memory usage. So depending on what else you've got running and the machine's RAM, you might be using swap memory by the time the most_similar()
calculations are running – which is very slow for the calculate-against-every-vector-and-sort-results similarity ops. (Still, any most_similar()
checks after the 1st won't need to re-fill the unit-normalized vector cache, so should go faster than the 1st call.)
Given that you've saved the model after init_sims(replace=True)
, its raw vectors are already unit-normalized. So you can manually-patch the model to skip the recalculation, just after your load()
:
model.syn0norm = model.syn0
Then even your first most_similar()
will just consult the (single, memory-mapped) set of vectors, without triggering an init_sims()
.
If that's still too slow, you may need more memory or to trim the vectors to a subset. The GoogleNews vectors seem to be sorted to put the most-frequent words earliest, so throwing out the last 10%, 50%, even 90% may still leave you with a useful set of the most-common words. (You'd need to perform this trimming yourself by looking at the model object and source code.)
Finally, you can use a nearest-neighbors indexing to get faster top-N matches, but at a cost of extra memory and approximate results (that may miss some of the true top-N matches). There's an IPython notebook tutorial in recent gensim versions at annoytutorial.ipynb IPython notebook of the demo IPython notebooks, in the gensim docs/notebooks
directory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With