I read this page but I do not understand what is different between models which are built based on the following codes. I know when dbow_words is 0, training of doc-vectors is faster.
First model
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)
Second model
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4,dbow_words=1)
Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn't only give the simple average of the words in the sentence.
A size of 100 means the vector representing each document will contain 100 elements - 100 values. The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.
In this post, I will attempt to convey the two document embeddings models from Paragraph Vector (more popularly known as Doc2Vec) — Distributed Memory (PV-DM) and Distributed Bag Of Words (DBOW) and point to some other document embedding approaches in last part of this post.
The dbow_words
parameter only has effect when training a DBOW model –
that is, with the non-default dm=0
parameter.
So, between your two example lines of code, which both leave the default dm=1
value unchanged, there's no difference.
If you instead switch to DBOW training, dm=0
, then with a default dbow_words=0
setting, the model is pure PV-DBOW as described in the original 'Paragraph Vectors' paper. Doc-vectors are trained to be predictive of text example words, but no word-vectors are trained. (There'll still be some randomly-initialized word-vectors in the model, but they're not used or improved during training.) This mode is fast and still works pretty well.
If you add the dbow_words=1
setting, then skip-gram word-vector training will be added to the training, in an interleaved fashion. (For each text example, both doc-vectors over the whole text, then word-vectors over each sliding context window, will be trained.) Since this adds more training examples, as a function of the window
parameter, it will be significantly slower. (For example, with window=5
, adding word-training will make training about 5x slower.)
This has the benefit of placing both the DBOW doc-vectors and the word-vectors into the "same space" - perhaps making the doc-vectors more interpretable by their closeness to words.
This mixed training might serve as a sort of corpus-expansion – turning each context-window into a mini-document – that helps improve the expressiveness of the resulting doc-vector embeddings. (Though, especially with sufficiently large and diverse document sets, it may be worth comparing against pure-DBOW with more passes.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With