I am trying to run doc2vec library from gensim package. My problem is that when I am training and saving the model the model file is rather large(2.5 GB) I tried using this line : <pre class="prettyprint"><code>model.estimate_memory() </code></pre> But it didn't change anything. I also have tried to change max_vocab_size to decrease the space. But there was not luck. Can somebody help me with this matter?

Doc2Vec models can be large. In particular, any word-vectors in use will use 4 bytes per dimension, times two layers of the model. So a 300-dimension model with a 200,000 word vocabulary will use just for the vectors array itself: <pre class="prettyprint"><code>200,000 vectors * 300 dimensions * 4 bytes/float * 2 layers = 480MB </code></pre> (There will be additional overhead for the dictionary storing vocabulary information.) Any doc-vectors will also use 4 bytes per dimnsion. So if you train a vectors for a million doc-tags, the model will use just for the doc-vectors array: <pre class="prettyprint"><code>1,000,000 vectors * 300 dimensions * 4 bytes/float = 2.4GB </code></pre> (If you're using arbitrary string tags to name the doc-vectors, there'll be additional overhead for that.) To use less memory when loaded (which will also result in a smaller store file), you can use a smaller vocabulary, train fewer doc-vecs, or use a smaller vector size. If you'll only need the model for certain narrow purposes, there may be other parts you can throw out after training – but that requires knowledge of the model internals/source-code, and your specific needs, and will result in a model that's broken (and likely to throw errors) for many other usual operations.

Gensim Doc2Vec generating huge file for model [closed]

Tags:

python

semantics

gensim

word2vec

doc2vec

I am trying to run doc2vec library from gensim package. My problem is that when I am training and saving the model the model file is rather large(2.5 GB) I tried using this line :

model.estimate_memory()

But it didn't change anything. I also have tried to change max_vocab_size to decrease the space. But there was not luck. Can somebody help me with this matter?

424

asked Jul 19 '17 15:07

ida

1 Answers

Doc2Vec models can be large. In particular, any word-vectors in use will use 4 bytes per dimension, times two layers of the model. So a 300-dimension model with a 200,000 word vocabulary will use just for the vectors array itself:

200,000 vectors * 300 dimensions * 4 bytes/float * 2 layers = 480MB

(There will be additional overhead for the dictionary storing vocabulary information.)

Any doc-vectors will also use 4 bytes per dimnsion. So if you train a vectors for a million doc-tags, the model will use just for the doc-vectors array:

1,000,000 vectors * 300 dimensions * 4 bytes/float = 2.4GB

(If you're using arbitrary string tags to name the doc-vectors, there'll be additional overhead for that.)

To use less memory when loaded (which will also result in a smaller store file), you can use a smaller vocabulary, train fewer doc-vecs, or use a smaller vector size.

If you'll only need the model for certain narrow purposes, there may be other parts you can throw out after training – but that requires knowledge of the model internals/source-code, and your specific needs, and will result in a model that's broken (and likely to throw errors) for many other usual operations.

answered Oct 11 '22 22:10

gojomo

Related questions
                            
                                join dataframes using parts of datetime index
                            
                                TkInter Frame doesn't load if another function is called
                            
                                What exactly is the variance on the parameters of SciPy curve fit? (Python)
                            
                                Checking to see if Gtk mainloop is running
                            
                                Python: Requests Proxies not working
                            
                                creating new columns in a data set based on values of a column using Regex
                            
                                seaborn boxplot x-axis as numbers, not labels
                            
                                Anaconda - Spyder is very slow to start on Windows 8 (checking for updates?)
                            
                                Find every two (non-overlapping) vowels inbetween consonants
                            
                                Plotly (Dash) tick label overwriting
                            
                                Why isn't urls.py generated with django-admin startapp mysite?
                            
                                Unique random number sampling with Numpy
                            
                                Difference between axis('equal') and axis('scaled') in matplotlib
                            
                                sphinx: Including .tex file via raw:: latex
                            
                                Why is vectorized numpy code slower than for loops?
                            
                                How can you compare two cluster groupings in terms of similarity or overlap in Python?
                            
                                pandas: operations using groupby yield SettingWithCopyWarning
                            
                                Python 2.7 Converting Bitcoin Privkey into WIF Privkey
                            
                                One horizontal colorbar for seaborn heatmaps subplots and Annot Issue with xticklabels
                            
                                Python: how to randomly sample from nonstandard Cauchy distribution, hence with different parameters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With