Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use word2vec to calculate the similarity distance by giving 2 words?

Tags:

word2vec

Word2vec is a open source tool to calculate the words distance provided by Google. It can be used by inputting a word and output the ranked word lists according to the similarity. E.g.

Input:

france

Output:

            Word       Cosine distance

            spain              0.678515
          belgium              0.665923
      netherlands              0.652428
            italy              0.633130
      switzerland              0.622323
       luxembourg              0.610033
         portugal              0.577154
           russia              0.571507
          germany              0.563291
        catalonia              0.534176

However, what I need to do is to calculate the similarity distance by giving 2 words. If I give the 'france' and 'spain', how can I get the score 0.678515 without reading the whole words list by giving just 'france'.

like image 719
zhfkt Avatar asked Feb 24 '14 05:02

zhfkt


People also ask

How does Word2Vec calculate similarity?

Therefore, Word2Vec can capture the similarity value between words from the training of a large corpus. The resulting similarity value is obtained from the word vector value than calculated using the Cosine Similarity equation.

Which similarity metrics are often used in calculating the distance between two Embeddings?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

How do you find the cosine similarity between two documents?

The common way to compute the Cosine similarity is to first we need to count the word occurrence in each document. To count the word occurrence in each document, we can use CountVectorizer or TfidfVectorizer functions that are provided by Scikit-Learn library.


3 Answers

gensim has a Python implementation of Word2Vec which provides an in-built utility for finding similarity between two words given as input by the user. You can refer to the following:

  1. Intro: http://radimrehurek.com/gensim/models/word2vec.html
  2. Tutorial: http://radimrehurek.com/2014/02/word2vec-tutorial/

UPDATED: Gensim 4.0.0 and above

The syntax in Python for finding similarity between two words goes like this:

>> from gensim.models import Word2Vec >> model = Word2Vec.load(path/to/your/model) >> model.wv.similarity('france', 'spain') 
like image 119
Satarupa Guha Avatar answered Sep 24 '22 07:09

Satarupa Guha


As you know word2vec can represent a word as a mathematical vector. So once you train the model, you can obtain the vectors of the words spain and france and compute the cosine distance (dot product).

An easy way to do this is to use this Python wrapper of word2vec. You can obtain the vector using this:

>>> model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)

To compute the distances between two words, you can do the following:

>>> import numpy    
>>> cosine_similarity = numpy.dot(model['spain'], model['france'])/(numpy.linalg.norm(model['spain'])* numpy.linalg.norm(model['france']))
like image 25
phyrox Avatar answered Sep 22 '22 07:09

phyrox


I just stumbled on this while looking for how to do this by modifying the original distance.c version, not by using another library like gensim.

I didn't find an answer so I did some research, and am sharing it here for others who also want to know how to do it in the original implementation.

After looking through the C source, you will find that 'bi' is an array of indexes. If you provide two words, the index for word1 will be in bi[0] and the index of word2 will be in bi[1].

The model 'M' is an array of vectors. Each word is represented as a vector with dimension 'size'.

Using these two indexes and the model of vectors, look them up and calculate the cosine distance (which is the same as the dot product) like this:

dist = 0;
for (a = 0; a < size; a++) {
    dist += M[a + bi[0] * size] * M[a + bi[1] * size];
}

after this completes, the value 'dist' is the cosine similarity between the two words.

like image 29
binarymax Avatar answered Sep 23 '22 07:09

binarymax