Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpreting negative Word2Vec similarity from gensim

E.g. we train a word2vec model using gensim:

from gensim import corpora, models, similarities
from gensim.models.word2vec import Word2Vec

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
w2v_model = Word2Vec(texts, size=500, window=5, min_count=1)

And when we query the similarity between words, we find negative similarity scores:

>>> w2v_model.similarity('graph', 'computer')
0.046929569156789336
>>> w2v_model.similarity('graph', 'system')
0.063683518562347399
>>> w2v_model.similarity('survey', 'generation')
-0.040026775040430063
>>> w2v_model.similarity('graph', 'trees')
-0.0072684112978664561

How do we interpret the negative scores?

If it's a cosine similarity shouldn't the range be [0,1]?

What is the upper bound and lower bound of the Word2Vec.similarity(x,y) function? There isn't much written in the docs: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similarity =(

Looking at the Python wrapper code, there isn't much too: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1165

(If possible, please do point me to the .pyx code of where the similarity function is implemented.)

like image 787
alvas Avatar asked Feb 22 '17 03:02

alvas


People also ask

What does a negative cosine similarity mean?

Cosine similarity is like an inner product. If angle between two vector is larger than 90 degree, the value is negative, and that means that two faces(features) are clearly distinguishable. All reactions.

How does Gensim similarity work?

The Similarity class splits the index into several smaller sub-indexes (“shards”), which are disk-based. If your entire index fits in memory (~one million documents per 1GB of RAM), you can also use the MatrixSimilarity or SparseMatrixSimilarity classes directly.

How does word2vec measure similarity?

Therefore, Word2Vec can capture the similarity value between words from the training of a large corpus. The resulting similarity value is obtained from the word vector value than calculated using the Cosine Similarity equation.

Can cosine based similarity be negative?

Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative.


2 Answers

Cosine similarity ranges from -1 to 1, same as a regular cosine wave.

Cosine Wave

As for the source:

https://github.com/RaRe-Technologies/gensim/blob/ba1ce894a5192fc493a865c535202695bb3c0424/gensim/models/word2vec.py#L1511

def similarity(self, w1, w2):
    """
    Compute cosine similarity between two words.
    Example::
      >>> trained_model.similarity('woman', 'man')
      0.73723527
      >>> trained_model.similarity('woman', 'woman')
      1.0
    """
    return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])
like image 68
Eugene K Avatar answered Oct 24 '22 02:10

Eugene K


As others have said, the cosine similarity can range from -1 to 1 based on the angle between the two vectors being compared. The exact implementation in gensim is a simple dot product of the normalized vectors.

https://github.com/RaRe-Technologies/gensim/blob/4f0e2ae0531d67cee8d3e06636e82298cb554b04/gensim/models/keyedvectors.py#L581

def similarity(self, w1, w2):
        """
        Compute cosine similarity between two words.
        Example::
          >>> trained_model.similarity('woman', 'man')
          0.73723527
          >>> trained_model.similarity('woman', 'woman')
          1.0
        """
        return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))

In terms of interpretation, you can think of these values like you might think of correlation coefficients. A value of 1 is a perfect relationship between word vectors (e.g., "woman" compared with "woman"), a value of 0 represents no relationship between words, and a value of -1 represents a perfect opposite relationship between words.

like image 26
Donovan McMurray Avatar answered Oct 24 '22 02:10

Donovan McMurray