Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim: What is difference between word2vec and doc2vec?

Tags:

nlp

gensim

I'm kinda newbie and not native english so have some trouble understanding Gensim's word2vec and doc2vec.

I think both give me some words most similar with query word I request, by most_similar()(after training).

How can tell which case I have to use word2vec or doc2vec?

Someone could explain difference in short word, please?

Thanks.

like image 536
user3595632 Avatar asked Mar 16 '17 06:03

user3595632


People also ask

What is the difference between Word2Vec and Doc2Vec?

Doc2Vec is another widely used technique that creates an embedding of a document irrespective to its length. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.

Which is better Word2Vec or Doc2Vec?

They reported that Word2Vec—CBOW achieved a better accuracy than a few Doc2Vec models. Similar results were reported in [39]. The study compared Word2Vec and Doc2Vec performance in the supervised learning of the text categories. The algorithms were applied on the Reuters 21578 data.

What is Gensim Doc2Vec?

Advertisements. Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn't only give the simple average of the words in the sentence.

What is Doc2Vec embedding?

In a sum-up of the whole theory behind Doc2Vec, we can say that Doc2Vec is a model for vector representation of paragraphs extracted from the whole word embedding or text documents. A detailed explanation of the Doc2Vec model can be found in this article.


1 Answers

In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X? If two authors generally use the same words then their vector will be closer. AUTHOR_X is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).

Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).

If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.

like image 88
pembeci Avatar answered Oct 12 '22 11:10

pembeci