I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors: <pre class="prettyprint"><code>array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) </code></pre> Is there a straight forward way to find the most similar word in my training data to this vector? Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest one? Thanks.

Don't forget to add empty array with negative words in most_similar function: <pre class="prettyprint"><code>import numpy as np model_word_vector = np.array( my_vector, dtype='f') topn = 20; most_similar_words = model.most_similar( [ model_word_vector ], [], topn) </code></pre>

How to find the closest word to a vector using word2vec

Tags:

python

data-analysis

text-mining

word2vec

I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors:

array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)

Is there a straight forward way to find the most similar word in my training data to this vector?

Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest one?

Thanks.

883

asked Sep 24 '15 11:09

sel

2 Answers

For gensim implementation of word2vec there is most_similar() function that lets you find words semantically close to a given word:

>>> model.most_similar(positive=['woman', 'king'], negative=['man']) [('queen', 0.50882536), ...]

or to it's vector representation:

>>> your_word_vector = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) >>> model.most_similar(positive=[your_word_vector], topn=1))

where topn defines the desired number of returned results.

However, my gut feeling is that function does exactly the same that you proposed, i.e. calculates cosine similarity for the given vector and each other vector in the dictionary (which is quite inefficient...)

170

answered Sep 20 '22 19:09

Nicolas Ivanov

Don't forget to add empty array with negative words in most_similar function:

import numpy as np model_word_vector = np.array( my_vector, dtype='f') topn = 20; most_similar_words = model.most_similar( [ model_word_vector ], [], topn)

answered Sep 20 '22 19:09

Andrew Krizhanovsky

Related questions
                            
                                Reading .csv in Python without looping through the whole file?
                            
                                How to dynamically compose and access class attributes in Python? [duplicate]
                            
                                Why does Python not perform type conversion when concatenating strings?
                            
                                Scrapy Crawl URLs in Order
                            
                                Python AST with preserved comments
                            
                                Passing a List to Python From Command Line
                            
                                What is the formal difference between "print" and "return"? [duplicate]
                            
                                matplotlib axis label format
                            
                                Python prevent copying object as reference
                            
                                Extract list of Persons and Organizations using Stanford NER Tagger in NLTK
                            
                                django-debug-toolbar breaking on admin while getting sql stats
                            
                                Django Rest Framework JWT Unit Test
                            
                                Mixing files and loops
                            
                                Getting the keyword arguments actually passed to a Python method
                            
                                NumPy grouping using itertools.groupby performance
                            
                                How do you read from stdin in python from a pipe which has no ending
                            
                                How to call a function on a running Python thread
                            
                                how to get the number of channels from an image, in OpenCV 2?
                            
                                Django - New fonts?
                            
                                Python 2.x super __init__ inheritance doesn't work when parent doesn't inherit from object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With