Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cosine similarity of word2vec more than 1

I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms of few words.

I see something like this:

w2vmodel.findSynonyms('science',4).show(5)
+------------+------------------+
|        word|        similarity|
+------------+------------------+
|     physics| 1.714908638833209|
|     fiction|1.5189824643358183|
|neuroscience|1.4968051528391833|
|  psychology| 1.458865636374223|
+------------+------------------+

I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles).

Why it is more than 1 here? What's going wrong here?

like image 220
Baktaawar Avatar asked Dec 29 '16 20:12

Baktaawar


People also ask

Can cosine similarity be more than 1?

Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative.

Does Word2Vec use cosine similarity?

Word2Vec is a model used to represent words into vectors. Then, the similarity value can be generated using the Cosine Similarity formula of the word vector values produced by the Word2Vec model.

What does it mean if cosine similarity is 1?

The measure computes the cosine of the angle between vectors x and y. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors.

How do you increase cosine similarity?

In order to find cosine similarity between two documents x and y we need to normalize them to one in L_2 norm (2). By having two normalized vectors x and y the cosine similarity between them will be simply the dot product of them (Eq. 3).


1 Answers

You should normalize the word vectors that you got from word2vec, otherwise you would get unbounded dot product or cosine similarity values.

From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

How to do normalization?

You can do something like below.

import numpy as np

def normalize(word_vec):
    norm=np.linalg.norm(word_vec)
    if norm == 0: 
       return word_vec
    return word_vec/norm

References

  • Should I do normalization to word embeddings from word2vec if I want to do semantic tasks?
  • Should I normalize word2vec's word vectors before using them?

Update: Why cosine similarity of word2vec is greater than 1?

According to this answer, in spark implementation of word2vec, findSynonyms doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector.

The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.

like image 77
Wasi Ahmad Avatar answered Sep 28 '22 07:09

Wasi Ahmad