E.g. we train a word2vec model using <code>gensim</code>: <pre class="prettyprint"><code>from gensim import corpora, models, similarities from gensim.models.word2vec import Word2Vec documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] texts = [[word for word in document.lower().split()] for document in documents] w2v_model = Word2Vec(texts, size=500, window=5, min_count=1) </code></pre> And when we query the similarity between words, we find negative similarity scores: <pre class="prettyprint"><code>>>> w2v_model.similarity('graph', 'computer') 0.046929569156789336 >>> w2v_model.similarity('graph', 'system') 0.063683518562347399 >>> w2v_model.similarity('survey', 'generation') -0.040026775040430063 >>> w2v_model.similarity('graph', 'trees') -0.0072684112978664561 </code></pre> How do we interpret the negative scores? If it's a cosine similarity shouldn't the range be <code>[0,1]</code>? What is the upper bound and lower bound of the <code>Word2Vec.similarity(x,y)</code> function? There isn't much written in the docs: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similarity =( Looking at the Python wrapper code, there isn't much too: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1165 (If possible, please do point me to the <code>.pyx</code> code of where the similarity function is implemented.)

Cosine similarity ranges from -1 to 1, same as a regular cosine wave. <img src="https://i.stack.imgur.com/5hD3g.png" alt="Cosine Wave"> As for the source: https://github.com/RaRe-Technologies/gensim/blob/ba1ce894a5192fc493a865c535202695bb3c0424/gensim/models/word2vec.py#L1511 <pre class="prettyprint"><code>def similarity(self, w1, w2): """ Compute cosine similarity between two words. Example:: >>> trained_model.similarity('woman', 'man') 0.73723527 >>> trained_model.similarity('woman', 'woman') 1.0 """ return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]) </code></pre>

As others have said, the cosine similarity can range from -1 to 1 based on the angle between the two vectors being compared. The exact implementation in gensim is a simple dot product of the normalized vectors. https://github.com/RaRe-Technologies/gensim/blob/4f0e2ae0531d67cee8d3e06636e82298cb554b04/gensim/models/keyedvectors.py#L581 <pre class="prettyprint"><code>def similarity(self, w1, w2): """ Compute cosine similarity between two words. Example:: >>> trained_model.similarity('woman', 'man') 0.73723527 >>> trained_model.similarity('woman', 'woman') 1.0 """ return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])) </code></pre> In terms of interpretation, you can think of these values like you might think of correlation coefficients. A value of 1 is a perfect relationship between word vectors (e.g., "woman" compared with "woman"), a value of 0 represents no relationship between words, and a value of -1 represents a perfect opposite relationship between words.

Interpreting negative Word2Vec similarity from gensim

Tags:

python

nlp

similarity

gensim

word2vec

E.g. we train a word2vec model using gensim:

from gensim import corpora, models, similarities
from gensim.models.word2vec import Word2Vec

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
w2v_model = Word2Vec(texts, size=500, window=5, min_count=1)

And when we query the similarity between words, we find negative similarity scores:

>>> w2v_model.similarity('graph', 'computer')
0.046929569156789336
>>> w2v_model.similarity('graph', 'system')
0.063683518562347399
>>> w2v_model.similarity('survey', 'generation')
-0.040026775040430063
>>> w2v_model.similarity('graph', 'trees')
-0.0072684112978664561

How do we interpret the negative scores?

If it's a cosine similarity shouldn't the range be [0,1]?

What is the upper bound and lower bound of the Word2Vec.similarity(x,y) function? There isn't much written in the docs: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similarity =(

Looking at the Python wrapper code, there isn't much too: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1165

(If possible, please do point me to the .pyx code of where the similarity function is implemented.)

787

asked Feb 22 '17 03:02

alvas

2 Answers

Cosine similarity ranges from -1 to 1, same as a regular cosine wave.

Cosine Wave

As for the source:

https://github.com/RaRe-Technologies/gensim/blob/ba1ce894a5192fc493a865c535202695bb3c0424/gensim/models/word2vec.py#L1511

def similarity(self, w1, w2):
    """
    Compute cosine similarity between two words.
    Example::
      >>> trained_model.similarity('woman', 'man')
      0.73723527
      >>> trained_model.similarity('woman', 'woman')
      1.0
    """
    return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])

answered Oct 24 '22 02:10

Eugene K

As others have said, the cosine similarity can range from -1 to 1 based on the angle between the two vectors being compared. The exact implementation in gensim is a simple dot product of the normalized vectors.

https://github.com/RaRe-Technologies/gensim/blob/4f0e2ae0531d67cee8d3e06636e82298cb554b04/gensim/models/keyedvectors.py#L581

def similarity(self, w1, w2):
        """
        Compute cosine similarity between two words.
        Example::
          >>> trained_model.similarity('woman', 'man')
          0.73723527
          >>> trained_model.similarity('woman', 'woman')
          1.0
        """
        return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))

In terms of interpretation, you can think of these values like you might think of correlation coefficients. A value of 1 is a perfect relationship between word vectors (e.g., "woman" compared with "woman"), a value of 0 represents no relationship between words, and a value of -1 represents a perfect opposite relationship between words.

answered Oct 24 '22 02:10

Donovan McMurray

Related questions
                            
                                Choosing DB pool_size for a Flask-SQLAlchemy app running on Gunicorn
                            
                                How to move Jupyter notebook cells up/down using keyboard shortcut?
                            
                                How do I get urllib2 to log ALL transferred bytes
                            
                                Will the real path.py please stand up?
                            
                                Python import precedence: packages or modules?
                            
                                Filtering on a left join in SQLalchemy
                            
                                Strings won't be translated in Django using format function available in Python 2.7
                            
                                Can I use Artifactory with Python PIP?
                            
                                Why pickle eat memory?
                            
                                What happens when you import a package?
                            
                                Matplotlib: Change math font size
                            
                                How to achieve assertDictEqual with assertSequenceEqual applied to values
                            
                                Passing multiple arguments to apply (Python)
                            
                                Avoiding duplicates with factory_boy factories
                            
                                Detect if a cube and a cone intersect each other?
                            
                                Pandas: Group by calendar-week, then plot grouped barplots for the real datetime
                            
                                How do you configure PyCharm to run py.test with command-line options like -s?
                            
                                Django ignores router when running tests?
                            
                                Automatically create a toctree for autodoc classes in Sphinx
                            
                                Pandas transform() vs apply()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With