I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector
you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector
, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks @gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With