Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim Doc2Vec generating huge file for model [closed]

I am trying to run doc2vec library from gensim package. My problem is that when I am training and saving the model the model file is rather large(2.5 GB) I tried using this line :

model.estimate_memory()

But it didn't change anything. I also have tried to change max_vocab_size to decrease the space. But there was not luck. Can somebody help me with this matter?

like image 424
ida Avatar asked Jul 19 '17 15:07

ida


People also ask

How to create document vectors using doc2vec in Gensim?

Here to create document vectors using Doc2Vec, we will be using text8 dataset which can be downloaded from gensim.downloader. It will take some time to download the text8 dataset. In order to train the model, we need the tagged document which can be created by using models.doc2vec.TaggedDcument () as follows −

How to train word2vec models with Gensim?

One very common approach is to use the well-known word2vec algorithm, and generalize it to documents level, which is also known as doc2vec. A great python library to train such doc2vec models, is Gensim. And this is what this tutorial will show. This is the underlying assumption behind word2vec which allows it to be so powerful.

What is the difference between other_model and doc2vec?

Beware vocabulary edits/updates to either model afterwards: the partial sharing and out-of-band modification may leave the other model in a broken state. other_model ( Doc2Vec) – Other model whose internal data structures will be copied over to the current object. Save the model.

How to train a doc2vec model on a home laptop?

Training a doc2vec model in the old style, require all the data to be in memory. In this way, training a model on a large corpus is nearly impossible on a home laptop. Gensim introduced a way to stream documents one by one from the disk, instead of heaving them all stored in RAM


1 Answers

Doc2Vec models can be large. In particular, any word-vectors in use will use 4 bytes per dimension, times two layers of the model. So a 300-dimension model with a 200,000 word vocabulary will use just for the vectors array itself:

200,000 vectors * 300 dimensions * 4 bytes/float * 2 layers = 480MB

(There will be additional overhead for the dictionary storing vocabulary information.)

Any doc-vectors will also use 4 bytes per dimnsion. So if you train a vectors for a million doc-tags, the model will use just for the doc-vectors array:

1,000,000 vectors * 300 dimensions * 4 bytes/float = 2.4GB

(If you're using arbitrary string tags to name the doc-vectors, there'll be additional overhead for that.)

To use less memory when loaded (which will also result in a smaller store file), you can use a smaller vocabulary, train fewer doc-vecs, or use a smaller vector size.

If you'll only need the model for certain narrow purposes, there may be other parts you can throw out after training – but that requires knowledge of the model internals/source-code, and your specific needs, and will result in a model that's broken (and likely to throw errors) for many other usual operations.

like image 62
gojomo Avatar answered Oct 11 '22 22:10

gojomo