Cosine similarity of word2vec more than 1

Tags:

I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms of few words.

I see something like this:

w2vmodel.findSynonyms('science',4).show(5)
+------------+------------------+
|        word|        similarity|
+------------+------------------+
|     physics| 1.714908638833209|
|     fiction|1.5189824643358183|
|neuroscience|1.4968051528391833|
|  psychology| 1.458865636374223|
+------------+------------------+

I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles).

Why it is more than 1 here? What's going wrong here?

220

asked Dec 29 '16 20:12

Baktaawar

1 Answers

You should normalize the word vectors that you got from word2vec, otherwise you would get unbounded dot product or cosine similarity values.

From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

How to do normalization?

You can do something like below.

import numpy as np

def normalize(word_vec):
    norm=np.linalg.norm(word_vec)
    if norm == 0: 
       return word_vec
    return word_vec/norm

References

Should I do normalization to word embeddings from word2vec if I want to do semantic tasks?
Should I normalize word2vec's word vectors before using them?

Update: Why cosine similarity of word2vec is greater than 1?

According to this answer, in spark implementation of word2vec, findSynonyms doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector.

The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.

answered Sep 28 '22 07:09

Wasi Ahmad

Related questions
                            
                                MySQL Stored Procedures, Pandas, and "Use multi=True when executing multiple statements"
                            
                                Iterate over itertools.product in different order without ever creating list
                            
                                How to call pypdfocr functions to use them in a python script?
                            
                                Share DB connection in a process pool
                            
                                Getting covariance matrix of fitted parameters from scipy optimize.least_squares method
                            
                                Python with Selenium: Is it possible to refresh frame instead of the whole page?
                            
                                Is it possible to specify the driver dll directly in the ODBC connection string?
                            
                                Run Python + OpenCV + dlib in Azure Functions
                            
                                PyYAML dumping boolean
                            
                                Workflow for adding new columns from Pandas to SQLite tables
                            
                                Vim on Ubuntu 14.04 uses a funny python path, python can't import _io among other modules
                            
                                Overcome memory constraints while using multiprocessing
                            
                                Can a Tkinter button have children?
                            
                                How to handle BigTable Scan InvalidChunk exceptions?
                            
                                Time-based .rolling() fails with group by
                            
                                Optimising iterative computation of values based on growth rate
                            
                                ThreadPoolExecutor + Requests == deadlock?
                            
                                Python AWS Lambda 301 redirect
                            
                                Python, pymysql class encapsulation of SSCursor not working as expected
                            
                                Template solutions similar to thymeleaf "natural templates" for jinja2/python/django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cosine similarity of word2vec more than 1

Tags:

python

apache-spark

pyspark

Baktaawar

People also ask

1 Answers

Wasi Ahmad

Recent Activity

Donate For Us