I'm training a <code>Word2Vec</code> model like: <pre class="prettyprint"><code>model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1) </code></pre> and <code>Doc2Vec</code> model like: <pre class="prettyprint"><code>doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1) doc2vec_model.build_vocab(doc2vec_tagged_documents) doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter) </code></pre> with the same data and comparable parameters. After this I'm using these models for my classification task. And I have found out that simply averaging or summing the <code>word2vec</code> embeddings of a document performs considerably better than using the <code>doc2vec</code> vectors. I also tried with much more <code>doc2vec</code> iterations (25, 80 and 150 - makes no difference). Any tips or ideas why and how to improve <code>doc2vec</code> results? Update: This is how <code>doc2vec_tagged_documents</code> is created: <pre class="prettyprint"><code>doc2vec_tagged_documents = list() counter = 0 for document in documents: doc2vec_tagged_documents.append(TaggedDocument(document, [counter])) counter += 1 </code></pre> Some more facts about my data: <ul> <li>My training data contains 4000 documents</li> <li>with 900 words on average.</li> <li>My vocabulary size is about 1000 words.</li> <li>My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the <code>doc2vec</code> model like this, but it's almost the same result.</li> <li>My data is not about natural language, please keep this in mind.</li> </ul>

Summing/averaging word2vec vectors is often quite good! It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.) If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (<code>dm=0</code>) as well. It'll train faster and is often a top-performer. If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector <code>size</code> may help.) But especially if <code>window</code> is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs. Other things that sometimes help improve Doc2Vec vectors for classification purposes: <ul> <li>re-inferring all document vectors, at the end of training, perhaps even using parameters different from <code>infer_vector()</code> defaults, such as <code>infer_vector(tokens, steps=50, alpha=0.025)</code> – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training</li> <li>where classification labels are known, adding them as trained doc-tags, using the capability of <code>TaggedDocument</code> <code>tags</code> to be a list of tags</li> <li>rare words are essentially just noise to Word2Vec or Doc2Vec - so a <code>min_count</code> above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing <code>min_count</code>.)</li> </ul> Hope this helps.

Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

Tags:

python

machine-learning

gensim

word2vec

doc2vec

I'm training a Word2Vec model like:

model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)

and Doc2Vec model like:

doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

with the same data and comparable parameters.

After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).

Any tips or ideas why and how to improve doc2vec results?

Update: This is how doc2vec_tagged_documents is created:

doc2vec_tagged_documents = list()
counter = 0
for document in documents:
    doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
    counter += 1

Some more facts about my data:

My training data contains 4000 documents
with 900 words on average.
My vocabulary size is about 1000 words.
My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
My data is not about natural language, please keep this in mind.

590

asked Jul 21 '17 09:07

ScientiaEtVeritas

1 Answers

Summing/averaging word2vec vectors is often quite good!

It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)

If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.

If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.

Other things that sometimes help improve Doc2Vec vectors for classification purposes:

re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training
where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags
rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)

Hope this helps.

109

answered Sep 19 '22 16:09

gojomo

Related questions
                            
                                How do I call a database function using SQLAlchemy in Flask?
                            
                                Reorder Python argparse argument groups
                            
                                python: update dataframe to existing excel sheet without overwriting contents on the same sheet and other sheets
                            
                                Flask send stream as response
                            
                                Convert date to ordinal python?
                            
                                NetworkX: how to add weights to an existing G.edges()?
                            
                                How can I sample equally from a dataframe?
                            
                                How to group by one column and sort the values of another column?
                            
                                Trying to understand isolation forest algorithm
                            
                                Django url that captures yyyy-mm-dd date
                            
                                How to remove empty rows from an Pyspark RDD
                            
                                What is a keyword in Robot Framework?
                            
                                Python 3.5 dill pickling/unpickling on different servers: "KeyError: 'ClassType'"
                            
                                How to find Run length encoding in python
                            
                                Two functions in parallel with multiple arguments and return values
                            
                                Is it possible to build reports with Python Pandas?
                            
                                Read from bytes not filename to convert audio
                            
                                Convert string to random but deterministically repeatable uniform probability
                            
                                Implement K-fold cross validation in MLPClassification Python
                            
                                pyMySQL: How to check if connection is already opened or close

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With