What is gensim's 'docvecs'?

Question

Doc2Vec Figure 2

The above picture is from Distributed Representations of Sentences and Documents, the paper introducing Doc2Vec. I am using Gensim's implementation of Word2Vec and Doc2Vec, which are great, but I am looking for clarity on a few issues.

For a given doc2vec model dvm, what is dvm.docvecs? My impression is that it is the averaged or concatenated vector that includes all of the word embedding and the paragraph vector, d. Is this correct, or is it d?
Supposing dvm.docvecs is not d, can one access d by itself? How?
As a bonus, how is d calculated? The paper only says:

In our Paragraph Vector framework (see Figure 2), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W.

Thanks for any leads!

gojomo · Accepted Answer

The docvecs property of the Doc2Vec model holds all trained vectors for the 'document tags' seen during training. (These are also referred to as 'doctags' in the source code.)

In the most simple case, analogous to the Paragraph Vectors paper, each text example (paragraph) just has a serial number integer ID as its 'tag', starting at 0. This would be an index into the docvecs object – and the model.docvecs.doctag_syn0 numpy array is essentially the same thing as the (capital) D in your excerpt from the Paragraph Vectors paper.

(Gensim also supports using string tokens as document tags, and multiple tags per document, and repeating tags across many of the training documents. For string tags, if any, they're mapped to indexes near the end of the docvecs by the dict model.docvecs.doctags.)

What is gensim's 'docvecs'?

Tags:

python

nlp

gensim

doc2vec

Michael Davidson

1 Answers

gojomo

Recent Activity

Donate For Us

What is gensim's 'docvecs'?

Tags:

python

nlp

gensim

doc2vec

Michael Davidson

1 Answers

gojomo

Related questions

Recent Activity

Donate For Us