What is different between doc2vec models when the dbow_words is set to 1 or 0?

Tags:

1 Answers

The dbow_words parameter only has effect when training a DBOW model – that is, with the non-default dm=0 parameter.

So, between your two example lines of code, which both leave the default dm=1 value unchanged, there's no difference.

If you instead switch to DBOW training, dm=0, then with a default dbow_words=0 setting, the model is pure PV-DBOW as described in the original 'Paragraph Vectors' paper. Doc-vectors are trained to be predictive of text example words, but no word-vectors are trained. (There'll still be some randomly-initialized word-vectors in the model, but they're not used or improved during training.) This mode is fast and still works pretty well.

If you add the dbow_words=1 setting, then skip-gram word-vector training will be added to the training, in an interleaved fashion. (For each text example, both doc-vectors over the whole text, then word-vectors over each sliding context window, will be trained.) Since this adds more training examples, as a function of the window parameter, it will be significantly slower. (For example, with window=5, adding word-training will make training about 5x slower.)

This has the benefit of placing both the DBOW doc-vectors and the word-vectors into the "same space" - perhaps making the doc-vectors more interpretable by their closeness to words.

This mixed training might serve as a sort of corpus-expansion – turning each context-window into a mini-document – that helps improve the expressiveness of the resulting doc-vector embeddings. (Though, especially with sufficiently large and diverse document sets, it may be worth comparing against pure-DBOW with more passes.)

146

answered Dec 29 '22 19:12

gojomo

Related questions
                            
                                Gensim (word2vec) retrieve n most frequent words
                            
                                Semantic Similarity between Phrases Using GenSim
                            
                                Does gensim.corpora.Dictionary have term frequency saved?
                            
                                Doc2vec MemoryError
                            
                                Does Doc2Vec learn representations for the tags?
                            
                                pyLDAvis with Mallet LDA implementation : LdaMallet object has no attribute 'inference'
                            
                                NLTK - Automatically translating similar words
                            
                                Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"
                            
                                Troubleshooting tips for clustering word2vec output with DBSCAN
                            
                                Pipeline and GridSearch for Doc2Vec
                            
                                Cosine similarity between 0 and 1
                            
                                Python Gensim how to make WMD similarity run faster with multiprocessing
                            
                                Gensim get topic for a document (seen document)
                            
                                How to build a gensim dictionary that includes bigrams?
                            
                                Understanding the output of Doc2Vec from Gensim package
                            
                                Is there any way to match Gensim LDA output with topics in pyLDAvis graph?
                            
                                How to avoid decoding to str: need a bytes-like object error in pandas?
                            
                                How can I access output embedding(output vector) in gensim word2vec?
                            
                                How do you initialize a gensim corpus variable with a csr_matrix?
                            
                                Python NLP British English vs American English

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is different between doc2vec models when the dbow_words is set to 1 or 0?

Tags:

gensim

doc2vec

user3092781

People also ask

1 Answers

gojomo

Recent Activity

Donate For Us