Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

I'm training a Word2Vec model like:

model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)

and Doc2Vec model like:

doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

with the same data and comparable parameters.

After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).

Any tips or ideas why and how to improve doc2vec results?

Update: This is how doc2vec_tagged_documents is created:

doc2vec_tagged_documents = list()
counter = 0
for document in documents:
    doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
    counter += 1

Some more facts about my data:

  • My training data contains 4000 documents
  • with 900 words on average.
  • My vocabulary size is about 1000 words.
  • My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
  • My data is not about natural language, please keep this in mind.
like image 590
ScientiaEtVeritas Avatar asked Jul 21 '17 09:07

ScientiaEtVeritas


People also ask

Is Doc2Vec better than Word2Vec?

To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents.

Does Doc2Vec use Word2Vec?

Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.

What is Gensim Doc2Vec?

Doc2Vec is a Model that represents each Document as a Vector. This tutorial introduces the model and demonstrates how to train and assess it. Here's a list of what we'll be doing: Review the relevant models: bag-of-words, Word2Vec, Doc2Vec. Load and preprocess the training and test corpora (see Corpus)

What is average Word2Vec?

It simply means that the author needed a single vector to represent a tweet so that he/she can run a classifier (probably). In other words, averaging of vectors was defined downstream by a tool that accepted a single vector.


1 Answers

Summing/averaging word2vec vectors is often quite good!

It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)

If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.

If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.

Other things that sometimes help improve Doc2Vec vectors for classification purposes:

  • re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training

  • where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags

  • rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)

Hope this helps.

like image 109
gojomo Avatar answered Sep 19 '22 16:09

gojomo