Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a limit in Gensim's Doc2Vec most_similar documents result set?

I have been experimenting with the doc2vec module for sometime now. I can train my model and have the trained model output similar documents for a given document as follows :

import re
modelloaded=Doc2Vec.load("model_all_doc_dm_1")

st = 'long description of a document as string'
doc = re.sub('[^a-zA-Z]', ' ', st).lower().split() 

new_doc_vec = modelloaded.infer_vector(doc)

modelloaded.docvecs.most_similar([new_doc_vec])

This works well, and gives me 10 results. Is there a way to get more than 10 results or is that the limit?

like image 633
ajaanbaahu Avatar asked Nov 18 '15 20:11

ajaanbaahu


People also ask

How do you increase Doc2Vec?

You could try tweaking the vector_size from 100 to something smaller or larger. If you think the set of vocabulary is limited, perhaps condensing vectors to smaller size may achieve better results because you are representing vocab by mapping to smaller vector size.

What is vector size in Doc2Vec?

The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.

What is the difference between Word2Vec and Doc2Vec?

While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus. Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.

Is Doc2Vec deterministic?

Because doc2vec is not deterministic, and we have a small training sample, we came up with two choices of strategies: (1) All studies were first divided into three subsamples A, B, and C.


1 Answers

I found it:

modelloaded.docvecs.most_similar([new_doc_vec], topn=N)

the topn=N handle can be used to get more than 10 results.

like image 184
ajaanbaahu Avatar answered Sep 18 '22 15:09

ajaanbaahu