The above picture is from Distributed Representations of Sentences and Documents, the paper introducing Doc2Vec. I am using Gensim's implementation of Word2Vec and Doc2Vec, which are great, but I am looking for clarity on a few issues.
dvm
, what is dvm.docvecs
? My impression is that it is the averaged or concatenated vector that includes all of the word embedding and the paragraph vector, d
. Is this correct, or is it d?dvm.docvecs
is not d
, can one access d by itself? How?d
calculated? The paper only says:In our Paragraph Vector framework (see Figure 2), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W.
Thanks for any leads!
The docvecs
property of the Doc2Vec model holds all trained vectors for the 'document tags' seen during training. (These are also referred to as 'doctags' in the source code.)
In the most simple case, analogous to the Paragraph Vectors paper, each text example (paragraph) just has a serial number integer ID as its 'tag', starting at 0. This would be an index into the docvecs
object – and the model.docvecs.doctag_syn0
numpy array is essentially the same thing as the (capital) D in your excerpt from the Paragraph Vectors paper.
(Gensim also supports using string tokens as document tags, and multiple tags per document, and repeating tags across many of the training documents. For string tags, if any, they're mapped to indexes near the end of the docvecs
by the dict model.docvecs.doctags
.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With