I have been experimenting with the doc2vec module for sometime now. I can train my model and have the trained model output similar documents for a given document as follows :
import re
modelloaded=Doc2Vec.load("model_all_doc_dm_1")
st = 'long description of a document as string'
doc = re.sub('[^a-zA-Z]', ' ', st).lower().split()
new_doc_vec = modelloaded.infer_vector(doc)
modelloaded.docvecs.most_similar([new_doc_vec])
This works well, and gives me 10 results. Is there a way to get more than 10 results or is that the limit?
You could try tweaking the vector_size from 100 to something smaller or larger. If you think the set of vocabulary is limited, perhaps condensing vectors to smaller size may achieve better results because you are representing vocab by mapping to smaller vector size.
The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.
While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus. Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.
Because doc2vec is not deterministic, and we have a small training sample, we came up with two choices of strategies: (1) All studies were first divided into three subsamples A, B, and C.
I found it:
modelloaded.docvecs.most_similar([new_doc_vec], topn=N)
the topn=N
handle can be used to get more than 10 results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With