I am using the doc2vec model from teh gensim framework to represent a corpus of 15 500 000 short documents (up to 300 words):
gensim.models.Doc2Vec(sentences, size=400, window=10, min_count=1, workers=8 )
After creating the vectors there are more than 18 000 000 vectors representing words and documents.
I want to find the most similar items (words or documents) for a given item:
similarities = model.most_similar(‘uid_10693076’)
but I get a MemoryError when the similarities are computed:
Traceback (most recent call last):
File "article/test_vectors.py", line 31, in <module>
similarities = model.most_similar(item)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 639, in most_similar
self.init_sims()
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 827, in init_sims
self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
I have a Ubuntu machine with 60GB Ram and 70GB swap . I checked the memory allocation (in htop) and I observed that never the memory was completely used. I also set to unlimited the the maximum address space which may be locked in memory in python:
resource.getrlimit(resource.RLIMIT_MEMLOCK)
Could someone explain the reason for this MemoryError? In my opinion the available memory should be enough for doing this computations. Could be some memory limits in python or OS?
Thanks in advance!
18M vectors * 400 dimensions * 4 bytes/float = 28.8GB for the model's syn0 array (trained vectors)
The syn1 array (hidden weights) will also be 28.8GB – even though syn1 doesn't really need entries for doc-vectors, which are never target-predictions during training.
The vocabulary structures (vocab dict and index2word table) will likely add another GB or more. So that's all your 60GB RAM.
The syn0norm array, used for similarity calculations, will need another 28.8GB, for a total usage of around 90GB. It's the syn0norm creation where you're getting the error. But even if syn0norm creation succeeded, being that deep into virtual memory would likely ruin performance.
Some steps that might help:
Use a min_count of at least 2: words appearing once are unlikely to contribute much, but likely use a lot of memory. (But since words are a tiny portion of your syn0, this will only save a little.)
After training but before triggering init_sims(), discard the the syn1 array. You won't be able to train more, but your existing word/doc vectors remain accessible.
After training but before calling most_similar(), call init_sims() yourself with a replace=True parameter, to discard the non-normalized syn0 and replace it with the syn0norm. Again you won't be able to train more, but you'll save the syn0 memory.
In-progress work separating out the doc and word vectors, which will appear in gensim past verstion 0.11.1, should also eventually offer some relief. (It'll shrink the syn1 to only include word entries, and allow doc-vectors to come from a file-backed (memmap'd) array.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With