I'm training a Word2Vec
model like:
model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)
and Doc2Vec
model like:
doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)
with the same data and comparable parameters.
After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec
embeddings of a document performs considerably better than using the doc2vec
vectors. I also tried with much more doc2vec
iterations (25, 80 and 150 - makes no difference).
Any tips or ideas why and how to improve doc2vec
results?
Update: This is how doc2vec_tagged_documents
is created:
doc2vec_tagged_documents = list()
counter = 0
for document in documents:
doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
counter += 1
Some more facts about my data:
doc2vec
model like this, but it's almost the same result.To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents.
Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.
Doc2Vec is a Model that represents each Document as a Vector. This tutorial introduces the model and demonstrates how to train and assess it. Here's a list of what we'll be doing: Review the relevant models: bag-of-words, Word2Vec, Doc2Vec. Load and preprocess the training and test corpora (see Corpus)
It simply means that the author needed a single vector to represent a tweet so that he/she can run a classifier (probably). In other words, averaging of vectors was defined downstream by a tool that accepted a single vector.
Summing/averaging word2vec vectors is often quite good!
It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)
If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0
) as well. It'll train faster and is often a top-performer.
If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size
may help.) But especially if window
is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.
Other things that sometimes help improve Doc2Vec vectors for classification purposes:
re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector()
defaults, such as infer_vector(tokens, steps=50, alpha=0.025)
– while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training
where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument
tags
to be a list of tags
rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count
above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count
.)
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With