Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the closest word to a vector using word2vec

I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors:

array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) 

Is there a straight forward way to find the most similar word in my training data to this vector?

Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest one?

Thanks.

like image 883
sel Avatar asked Sep 24 '15 11:09

sel


People also ask

Can Word2Vec be used for search?

Listing 3: word2vec similarity with 100 dimensions and a larger dataset. We can see now that the results are much better and appropriate: we can use almost all of them as synonyms in the context of search. You can imagine using such a technique either at query or indexing time.

How do you evaluate a Word2Vec model?

To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.


2 Answers

For gensim implementation of word2vec there is most_similar() function that lets you find words semantically close to a given word:

>>> model.most_similar(positive=['woman', 'king'], negative=['man']) [('queen', 0.50882536), ...] 

or to it's vector representation:

>>> your_word_vector = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) >>> model.most_similar(positive=[your_word_vector], topn=1)) 

where topn defines the desired number of returned results.

However, my gut feeling is that function does exactly the same that you proposed, i.e. calculates cosine similarity for the given vector and each other vector in the dictionary (which is quite inefficient...)

like image 170
Nicolas Ivanov Avatar answered Sep 20 '22 19:09

Nicolas Ivanov


Don't forget to add empty array with negative words in most_similar function:

import numpy as np model_word_vector = np.array( my_vector, dtype='f') topn = 20; most_similar_words = model.most_similar( [ model_word_vector ], [], topn) 
like image 40
Andrew Krizhanovsky Avatar answered Sep 20 '22 19:09

Andrew Krizhanovsky