Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is gensim's 'docvecs'?

Doc2Vec Figure 2

The above picture is from Distributed Representations of Sentences and Documents, the paper introducing Doc2Vec. I am using Gensim's implementation of Word2Vec and Doc2Vec, which are great, but I am looking for clarity on a few issues.

  1. For a given doc2vec model dvm, what is dvm.docvecs? My impression is that it is the averaged or concatenated vector that includes all of the word embedding and the paragraph vector, d. Is this correct, or is it d?
  2. Supposing dvm.docvecs is not d, can one access d by itself? How?
  3. As a bonus, how is d calculated? The paper only says:

In our Paragraph Vector framework (see Figure 2), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W.

Thanks for any leads!

like image 798
Michael Davidson Avatar asked Jan 18 '17 00:01

Michael Davidson


1 Answers

The docvecs property of the Doc2Vec model holds all trained vectors for the 'document tags' seen during training. (These are also referred to as 'doctags' in the source code.)

In the most simple case, analogous to the Paragraph Vectors paper, each text example (paragraph) just has a serial number integer ID as its 'tag', starting at 0. This would be an index into the docvecs object – and the model.docvecs.doctag_syn0 numpy array is essentially the same thing as the (capital) D in your excerpt from the Paragraph Vectors paper.

(Gensim also supports using string tokens as document tags, and multiple tags per document, and repeating tags across many of the training documents. For string tags, if any, they're mapped to indexes near the end of the docvecs by the dict model.docvecs.doctags.)

like image 123
gojomo Avatar answered Sep 22 '22 22:09

gojomo