Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

memory error when using gensim for loading word2vec

I am using gensim library for loading pre-trained word vectors from GoogleNews dataset. this dataset contains 3000000 word vectors each of 300 dimensions. when I want to load GoogleNews dataset, I receive a memory error. I have tried this code before without memory error and I don't know why I receive this error now. I have checked a lot of sites for solving this issue but I cant understand. this is my code for loading GoogleNews:

import gensim.models.keyedvectors as word2vec
model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)

and this is the error I received:

File "/home/mahsa/PycharmProjects/tensor_env_project/word_embedding_DUC2007/inspect_word2vec-master/word_embeddings_GoogleNews.py", line 8, in <module>
    model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)
  File "/home/mahsa/anaconda3/envs/tensorflow_env/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 212, in load_word2vec_format
    result.syn0 = zeros((vocab_size, vector_size), dtype=datatype)
MemoryError

can anybody help me? thanks.

like image 392
Mahsa Avatar asked May 23 '18 00:05

Mahsa


People also ask

Is Gensim Word2Vec CBOW or skip gram?

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.

How does Gensim Word2Vec work?

Word2Vec is a widely used word representation technique that uses neural networks under the hood. The resulting word representation or embeddings can be used to infer semantic similarity between words and phrases, expand queries, surface related concepts and more.

What is Min_count in Word2Vec?

min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5. workers: The number of partitions during training and the default workers is 3.

How long does Word2Vec take to train?

To train a Word2Vec model takes about 22 hours, and FastText model takes about 33 hours. If it's too long to you, you can use fewer "iter", but the performance might be worse.


2 Answers

Loading just the raw vectors will take...

3,000,000 words * 300 dimensions * 4 bytes/dimension = 3.6GB

...of addressable memory (plus some overhead for the word-key to index-position map).

Additionally, as soon as you want to do a most_similar()-type operation, unit-length normalized versions of the vectors will be created – which will require another 3.6GB. (You may instead clobber the raw vectors in place, saving that extra memory, if you'll only be doing cosine-similarity comparisons between the unit-normed vectors, by 1st doing a forced explicit model.init_sims(replace=True).)

So you'll generally only want to do full operations on a machine with at least 8GB of RAM. (Any swapping at all during full-array most_similar() lookups will make operations very slow.)

If anything else was using Python heap space, that could have accounted for the MemoryError you saw.

The load_word2vec_format() method also has an optional limit argument which will only load the supplied number of vectors – so you could use limit=500000 to cut the memory requirements by about 5/6ths. (And, since the GoogleNews and other vector sets are usually ordered from most- to least-frequent words, you'll get the 500K most-frequent words. Lower-frequency words generally have much less value and even not-as-good vectors, so it may not hurt much to ignore them.)

like image 71
gojomo Avatar answered Nov 14 '22 21:11

gojomo


To load the whole model one needs a bigger RAM.

You may use the following code. Set the limit to which your system can take. It'll load vectors that are at top of the file.

from gensim import models

w = models.KeyedVectors.load_word2vec_format(r"GoogleNews-vectors-negative300.bin.gz", binary=True, limit = 100000)

I set the limit as 100,000. It worked on my 4GB RAM laptop.

like image 28
Soubhik Mazumdar Avatar answered Nov 14 '22 21:11

Soubhik Mazumdar