Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the Document Vector from Doc2Vec in gensim 0.11.1?

Is there a way to get the document vectors of unseen and seen documents from Doc2Vec in the gensim 0.11.1 version?

  • For example, suppose I trained the model on 1000 thousand - Can I get the doc vector for those 1000 docs?

  • Is there a way to get document vectors of unseen documents composed
    from the same vocabulary?

like image 352
silent_dev Avatar asked Jun 11 '16 12:06

silent_dev


People also ask

What is vector size in Doc2Vec?

But in Doc2Vec, what does it really mean, in technical language? A size of 100 means the vector representing each document will contain 100 elements - 100 values. The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space.

Does Doc2Vec use word2vec?

Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.


1 Answers

For the first bullet point, you can do it in gensim 0.11.1

from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence

documents = []
documents.append( LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1']) )
documents.append( LabeledSentence(words=[u'some', u'people', u'words', u'like'], labels=[u'SENT_2']) )
documents.append( LabeledSentence(words=[u'people', u'like', u'words'], labels=[u'SENT_3']) )


model = Doc2Vec(size=10, window=8, min_count=0, workers=4)
model.build_vocab(documents)
model.train(documents)

print(model[u'SENT_3'])

Here SENT_3 is a known sentence.

For the second bullet point, you can NOT do it in gensim 0.11.1, you have to update it to 0.12.4. This latest version has infer_vector function which can generate a vector for an unseen document.

documents = []
documents.append( LabeledSentence([u'some', u'words', u'here'], [u'SENT_1']) )
documents.append( LabeledSentence([u'some', u'people', u'words', u'like'], [u'SENT_2']) )
documents.append( LabeledSentence([u'people', u'like', u'words'], [u'SENT_3']) )


model = Doc2Vec(size=10, window=8, min_count=0, workers=4)
model.build_vocab(documents)
model.train(documents)

print(model.docvecs[u'SENT_3']) # generate a vector for a known sentence
print(model.infer_vector([u'people', u'like', u'words'])) # generate a vector for an unseen sentence
like image 58
Munichong Avatar answered Sep 19 '22 06:09

Munichong