I'm kinda newbie and not native english so have some trouble understanding <code>Gensim</code>'s <code>word2vec</code> and <code>doc2vec</code>. I think both give me some words most similar with query word I request, by <code>most_similar()</code>(after training). How can tell which case I have to use <code>word2vec</code> or <code>doc2vec</code>? Someone could explain difference in short word, please? Thanks.

In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to <code>AUTHOR_X</code>? If two authors generally use the same words then their vector will be closer. <code>AUTHOR_X</code> is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other). Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post). If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.

Gensim: What is difference between word2vec and doc2vec?

1 Answers

In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X? If two authors generally use the same words then their vector will be closer. AUTHOR_X is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).

Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).

If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.

answered Oct 12 '22 11:10

pembeci

Related questions
                            
                                How does spacy use word embeddings for Named Entity Recognition (NER)?
                            
                                stanford core nlp java output
                            
                                Computing TF-IDF on the whole dataset or only on training data?
                            
                                SOLR and Natural Language Parsing - Can I use it?
                            
                                What is "unk" in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?
                            
                                How to correct the user input (Kind of google "did you mean?")
                            
                                N-grams: Explanation + 2 applications
                            
                                Get selected feature names TFIDF Vectorizer
                            
                                Spacy nlp = spacy.load("en_core_web_lg")
                            
                                Fast n-gram calculation
                            
                                LDA model generates different topics everytime i train on the same corpus
                            
                                Tools for text simplification (Java) [closed]
                            
                                How to use OpenNLP with Java?
                            
                                Unable to load the spacy model 'en_core_web_lg' on Google colab
                            
                                Interpreting negative Word2Vec similarity from gensim
                            
                                Algorithm for Negating Sentences
                            
                                Using Word2Vec for topic modeling
                            
                                nltk sentence tokenizer, consider new lines as sentence boundary
                            
                                Getting feature names from within a FeatureUnion + Pipeline
                            
                                NLTK - Counting Frequency of Bigram

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Gensim: What is difference between word2vec and doc2vec?

Tags:

nlp

gensim

user3595632

People also ask

1 Answers

pembeci

Recent Activity

Donate For Us