I am using gensim library for loading pre-trained word vectors from GoogleNews dataset. this dataset contains 3000000 word vectors each of 300 dimensions. when I want to load GoogleNews dataset, I receive a memory error. I have tried this code before without memory error and I don't know why I receive this error now. I have checked a lot of sites for solving this issue but I cant understand. this is my code for loading GoogleNews: <pre class="prettyprint"><code>import gensim.models.keyedvectors as word2vec model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True) </code></pre> and this is the error I received: <pre class="prettyprint"><code>File "/home/mahsa/PycharmProjects/tensor_env_project/word_embedding_DUC2007/inspect_word2vec-master/word_embeddings_GoogleNews.py", line 8, in <module> model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True) File "/home/mahsa/anaconda3/envs/tensorflow_env/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 212, in load_word2vec_format result.syn0 = zeros((vocab_size, vector_size), dtype=datatype) MemoryError </code></pre> can anybody help me? thanks.

Loading just the raw vectors will take... 3,000,000 words * 300 dimensions * 4 bytes/dimension = 3.6GB ...of addressable memory (plus some overhead for the word-key to index-position map). Additionally, as soon as you want to do a <code>most_similar()</code>-type operation, unit-length normalized versions of the vectors will be created – which will require another 3.6GB. (You may instead clobber the raw vectors in place, saving that extra memory, if you'll only be doing cosine-similarity comparisons between the unit-normed vectors, by 1st doing a forced explicit <code>model.init_sims(replace=True)</code>.) So you'll generally only want to do full operations on a machine with at least 8GB of RAM. (Any swapping at all during full-array <code>most_similar()</code> lookups will make operations very slow.) If anything else was using Python heap space, that could have accounted for the <code>MemoryError</code> you saw. The <code>load_word2vec_format()</code> method also has an optional <code>limit</code> argument which will only load the supplied number of vectors – so you could use <code>limit=500000</code> to cut the memory requirements by about 5/6ths. (And, since the <code>GoogleNews</code> and other vector sets are usually ordered from most- to least-frequent words, you'll get the 500K most-frequent words. Lower-frequency words generally have much less value and even not-as-good vectors, so it may not hurt much to ignore them.)

To load the whole model one needs a bigger RAM. You may use the following code. Set the limit to which your system can take. It'll load vectors that are at top of the file. <pre class="prettyprint"><code>from gensim import models w = models.KeyedVectors.load_word2vec_format(r"GoogleNews-vectors-negative300.bin.gz", binary=True, limit = 100000) </code></pre> I set the limit as 100,000. It worked on my 4GB RAM laptop.

memory error when using gensim for loading word2vec

Tags:

python

word-embedding

gensim

word2vec

google-news

I am using gensim library for loading pre-trained word vectors from GoogleNews dataset. this dataset contains 3000000 word vectors each of 300 dimensions. when I want to load GoogleNews dataset, I receive a memory error. I have tried this code before without memory error and I don't know why I receive this error now. I have checked a lot of sites for solving this issue but I cant understand. this is my code for loading GoogleNews:

import gensim.models.keyedvectors as word2vec
model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)

and this is the error I received:

File "/home/mahsa/PycharmProjects/tensor_env_project/word_embedding_DUC2007/inspect_word2vec-master/word_embeddings_GoogleNews.py", line 8, in <module>
    model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)
  File "/home/mahsa/anaconda3/envs/tensorflow_env/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 212, in load_word2vec_format
    result.syn0 = zeros((vocab_size, vector_size), dtype=datatype)
MemoryError

can anybody help me? thanks.

392

asked May 23 '18 00:05

Mahsa

2 Answers

Loading just the raw vectors will take...

3,000,000 words * 300 dimensions * 4 bytes/dimension = 3.6GB

...of addressable memory (plus some overhead for the word-key to index-position map).

Additionally, as soon as you want to do a most_similar()-type operation, unit-length normalized versions of the vectors will be created – which will require another 3.6GB. (You may instead clobber the raw vectors in place, saving that extra memory, if you'll only be doing cosine-similarity comparisons between the unit-normed vectors, by 1st doing a forced explicit model.init_sims(replace=True).)

So you'll generally only want to do full operations on a machine with at least 8GB of RAM. (Any swapping at all during full-array most_similar() lookups will make operations very slow.)

If anything else was using Python heap space, that could have accounted for the MemoryError you saw.

The load_word2vec_format() method also has an optional limit argument which will only load the supplied number of vectors – so you could use limit=500000 to cut the memory requirements by about 5/6ths. (And, since the GoogleNews and other vector sets are usually ordered from most- to least-frequent words, you'll get the 500K most-frequent words. Lower-frequency words generally have much less value and even not-as-good vectors, so it may not hurt much to ignore them.)

answered Nov 14 '22 21:11

gojomo

To load the whole model one needs a bigger RAM.

You may use the following code. Set the limit to which your system can take. It'll load vectors that are at top of the file.

from gensim import models

w = models.KeyedVectors.load_word2vec_format(r"GoogleNews-vectors-negative300.bin.gz", binary=True, limit = 100000)

I set the limit as 100,000. It worked on my 4GB RAM laptop.

answered Nov 14 '22 21:11

Soubhik Mazumdar

Related questions
                            
                                Using LSTM to predict a simple synthetic time series. Why is it that bad?
                            
                                python: Initial condition in solving differential equation
                            
                                I got an error Attempted relative import beyond top-level package
                            
                                Trying to call method on dict, getting AttributeError: 'dict' object attribute 'update' is read-only
                            
                                Change the input size in Keras
                            
                                QR Code Detection from Pyzbar with Camera Image
                            
                                DataError: No numeric types using mean aggregate function but not sum?
                            
                                How do I configure JsonFormatter in logging dictConfig?
                            
                                RuntimeError: main thread is not in main loop using Matplotlib with Django
                            
                                Sum of specific rows in a dataframe (Pandas)
                            
                                How to sort QTableWidget column values? [duplicate]
                            
                                when plotting with vbar on an xaxis that's a datetime axis, how can I set the width of the bars to be "one day"?
                            
                                Passing options to a function
                            
                                Find the lowercase (un-shifted) form of symbols
                            
                                SQLAlchemy: Get only one column [duplicate]
                            
                                How to use regex non-capturing groups format in Python
                            
                                Python/Threading/Barrier: Is this a correct usage of Barrier?
                            
                                dragging points in matplotlib interactive plot
                            
                                URL patterns in Django 2
                            
                                Writing a 3D Numpy array to a CSV file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With