I'm kinda newbie and not native english so have some trouble understanding Gensim
's word2vec
and doc2vec
.
I think both give me some words most similar with query word I request, by most_similar()
(after training).
How can tell which case I have to use word2vec
or doc2vec
?
Someone could explain difference in short word, please?
Thanks.
Doc2Vec is another widely used technique that creates an embedding of a document irrespective to its length. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.
They reported that Word2Vec—CBOW achieved a better accuracy than a few Doc2Vec models. Similar results were reported in [39]. The study compared Word2Vec and Doc2Vec performance in the supervised learning of the text categories. The algorithms were applied on the Reuters 21578 data.
Advertisements. Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn't only give the simple average of the words in the sentence.
In a sum-up of the whole theory behind Doc2Vec, we can say that Doc2Vec is a model for vector representation of paragraphs extracted from the whole word embedding or text documents. A detailed explanation of the Doc2Vec model can be found in this article.
In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X
? If two authors generally use the same words then their vector will be closer. AUTHOR_X
is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).
Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).
If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With