<pre class="prettyprint"><code>def cosine(vector1,vector2): cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2)) return cosV12 model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game') string='民生为了父亲我要坚强地 ...' list=string.split(' ') vector1=model.infer_vector(doc_words=list,alpha=0.1, min_alpha=0.0001,steps=5) vector2=model.docvecs.doctag_syn0[0] print cosine(vector2,vector1) </code></pre> -0.0232586 I use a train data to train a <code>doc2vec</code> model. Then, I use <code>infer_vector()</code> to generate a vector given a document which is in trained data. But they are different. The value of cosine was so small (<code>-0.0232586</code>) distance between the <code>vector2</code> which was saved in <code>doc2vec</code> model and the <code>vector1</code> which was generated by <code>infer_vector()</code>. But this is not reasonable ah ... I find where i have error in. I should use 'string=u'民生为了父亲我要坚强地 ...'' instead 'string='民生为了父亲我要坚强地 ...''. When I correct this way, the cosine distance is up to 0.889342.

As you've noticed, <code>infer_vector()</code> requires its <code>doc_words</code> argument to be a list of tokens – matching the same kind of tokenization that was used in training the model. (Passing it a string causes it to just see each individual character as an item in a tokenized list, and even if a few of the tokens are known vocabulary tokens – as with 'a' and 'I' in English – you're unlikely to get good results.) Additionally, the default parameters of <code>infer_vector()</code> may be far from optimal for many models. In particular, a larger <code>steps</code> (at least as large as the number of model training iterations, but perhaps even many times larger) is often helpful. Also, a smaller starting <code>alpha</code>, perhaps just the common default for bulk training of 0.025, may give better results. Your test of whether inference gets a vector close to the same vector from bulk-training is a reasonable sanity-check, on both your inference parameters and the earlier training – is the model as a whole learning generalizable patterns in the data? But because most modes of Doc2Vec inherently use randomness, or (during bulk training) can be affected by the randomness introduced by multiple-thread scheduling jitter, you shouldn't expect identical results. They'll just get generally closer, the more training iterations/steps you do. Finally, note that the <code>most_similar()</code> method on <code>Doc2Vec</code>'s <code>docvecs</code> component can also take a raw vector, to give back a list of most-similar already-known vectors. So you can try the following... <pre class="prettyprint"><code>ivec = model.infer_vector(doc_words=tokens_list, steps=20, alpha=0.025) print(model.most_similar(positive=[ivec], topn=10)) </code></pre> ...and get a ranked list of the top-10 most-similar <code>(doctag, similarity_score)</code> pairs.

How to use the infer_vector in gensim.doc2vec?

Tags:

python

gensim

doc2vec

def cosine(vector1,vector2):
    cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2))
    return cosV12
model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game')
string='民生 为了 父亲 我 要 坚强 地 ...'
list=string.split(' ')
vector1=model.infer_vector(doc_words=list,alpha=0.1, min_alpha=0.0001,steps=5)
vector2=model.docvecs.doctag_syn0[0]
print cosine(vector2,vector1)

-0.0232586

I use a train data to train a doc2vec model. Then, I use infer_vector() to generate a vector given a document which is in trained data. But they are different. The value of cosine was so small (-0.0232586) distance between the vector2 which was saved in doc2vec model and the vector1 which was generated by infer_vector(). But this is not reasonable ah ...

I find where i have error in. I should use 'string=u'民生为了父亲我要坚强地 ...'' instead 'string='民生为了父亲我要坚强地 ...''. When I correct this way, the cosine distance is up to 0.889342.

914

asked Jul 09 '17 05:07

Jeffery

1 Answers

As you've noticed, infer_vector() requires its doc_words argument to be a list of tokens – matching the same kind of tokenization that was used in training the model. (Passing it a string causes it to just see each individual character as an item in a tokenized list, and even if a few of the tokens are known vocabulary tokens – as with 'a' and 'I' in English – you're unlikely to get good results.)

Additionally, the default parameters of infer_vector() may be far from optimal for many models. In particular, a larger steps (at least as large as the number of model training iterations, but perhaps even many times larger) is often helpful. Also, a smaller starting alpha, perhaps just the common default for bulk training of 0.025, may give better results.

Your test of whether inference gets a vector close to the same vector from bulk-training is a reasonable sanity-check, on both your inference parameters and the earlier training – is the model as a whole learning generalizable patterns in the data? But because most modes of Doc2Vec inherently use randomness, or (during bulk training) can be affected by the randomness introduced by multiple-thread scheduling jitter, you shouldn't expect identical results. They'll just get generally closer, the more training iterations/steps you do.

Finally, note that the most_similar() method on Doc2Vec's docvecs component can also take a raw vector, to give back a list of most-similar already-known vectors. So you can try the following...

ivec = model.infer_vector(doc_words=tokens_list, steps=20, alpha=0.025)
print(model.most_similar(positive=[ivec], topn=10))

...and get a ranked list of the top-10 most-similar (doctag, similarity_score) pairs.

142

answered Oct 24 '22 05:10

gojomo

Related questions
                            
                                cartopy set_xlabel set_ylabel (not ticklabels)
                            
                                kafka-python - How do I commit a partition?
                            
                                How to balance classification using DecisionTreeClassifier?
                            
                                what does 'rb' mean in csv files? [duplicate]
                            
                                Debug a Flask (Python) web application in Visual Studio Code
                            
                                Looping through list of functions in a function in Python dynamically
                            
                                'Options' object has no attribute 'get_all_field_names'
                            
                                Selenium Python wait for text to be present in element error shows takes 3 arguments 2 given
                            
                                exec() not working inside function python3.x
                            
                                SQLAlchemy: __init__() takes 1 positional argument but 2 were given (many to many)
                            
                                Decimal Python vs. float runtime
                            
                                Python 3: gzip.open() and modes
                            
                                Python negate boolean function
                            
                                loop through folder in python and open files throws an error
                            
                                Why autocompletion options in Spyder 3.1 are not fully working in the Editor?
                            
                                Flush output in for loop in Jupyter notebook
                            
                                ImportError: cannot import name 'PandasError'
                            
                                NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted
                            
                                split string in python to get one value?
                            
                                How to round values only for display in pandas while retaining original ones in the dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With