Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: array is too big when loading GoogleNews-vectors-negative

Tags:

python

gensim

I am trying to load the pretrained word vectors from Google using the following code:

from gensim import models
w = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

But I am getting an error that tells me

File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 197, in load_word2vec_format result.syn0 = zeros((vocab_size, vector_size), dtype=datatype)

ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.

Could anyone suggest a possible solution. Thanks in advance.

like image 618
Winston Avatar asked Mar 10 '17 20:03

Winston


1 Answers

This is likely triggered because the Python you have installed uses 32-bit-addressing, and thus can't allocate arrays of the size required to load the GoogleNews vectors. Some options:

  • Switch to a 64-bit Python. Note that that full vector set takes 3GB+ to load, so unless you have more RAM than 4GB, it will be hard to work with the full set no matter what.
  • Use the optional limit parameter of gensim's load_word2vec_format() method to read only some of the early entries in the file. The file seems to be in most-frequent to least-frequent token order, so often the early entries are all you'll need. For example, you could try limit=500000 to read just the 1st 500,000 entries (instead of all 3 million)
like image 91
gojomo Avatar answered Nov 10 '22 00:11

gojomo