Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:

inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires']) sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) rank = [docid for docid, sim in sims] print(rank)

Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.

Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?

like image 974
Rohan Avatar asked Jan 21 '18 00:01

Rohan


2 Answers

Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks @gojomo).

    def seeded_vector(self, seed_string):
        """Create one 'random' vector (but deterministic by seed_string)"""
        # Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
        once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
        return (once.rand(self.vector_size) - 0.5) / self.vector_size
like image 116
l.augustyniak Avatar answered Sep 30 '22 10:09

l.augustyniak


Set negative=0 to avoid randomization:

import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20,  window=5, min_count=1,  negative=0, workers=6, epochs=10) 
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
    v1 = model.infer_vector(s)
    for i in range(100):
        v2 = model.infer_vector(s)
        assert np.all(v1 == v2), "Failed on %s" % (''.join(s))
like image 39
James Avatar answered Sep 30 '22 12:09

James