Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim Word2vec : Semantic Similarity

I wanted to know the difference between gensim word2vec's two similarity measures : most_similar() and most_similar_cosmul(). I know that the first one works using cosine similarity of word vectors while other one uses using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg. I want to know how it affects the results? Which one gives semantic similarity ? etc. Eg :

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])               

Result : [('queen', 0.50882536), ...]

model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])

Result : [(u'iraq', 0.8488819003105164), ...]

like image 471
bee2502 Avatar asked Jul 20 '15 19:07

bee2502


1 Answers

From the Levy and Goldberg paper, if you are trying to find analogies (or combining/comparing more than 2 word vectors), the first method (3CosAdd or eq.3 of paper) is more susceptible of getting dominated by 1 comparison, as compared to second method (3CosMul or eq.4 of paper).

Just for semantic similarity between 2 word vectors, this method doesn't apply.

Example, using Google News Vectors -

model.similarity('Mosul','England')
0.10051745730111421

model.similarity('Iraq','England')
0.14772211471143404

model.similarity('Mosul','Baghdad')
0.83855779792754492

model.similarity('Iraq','Baghdad')
0.67975755642668911

Now Iraq is closer to England than Mosul (both being countries), however similarity values are small ~ 0.1.

On the other hand Mosul is more similar to Baghdad than Iraq (geographical/cultural aspects), with similarity values of higher order ~ 0.7

Now, for analogy (England - London + Baghdad = X) -

3CosAdd being a linear sum, allows one large similarity term to dominate the expression. It ignores that each term reflects a different aspect of similarity, and the different aspects have different scales.

3CosMul, on the other hand - amplifies the differences between small quantities and reduces the differences between larger ones.

model.most_similar(positive=['Baghdad', 'England'], negative=['London'])
(u'Mosul', 0.5630180835723877)
(u'Iraq', 0.5184929370880127)

model.most_similar_cosmul(positive=['Baghdad', 'England'], negative=['London'])
(u'Mosul', 0.8537653088569641)
(u'Iraq', 0.8507866263389587)
like image 195
kampta Avatar answered Sep 30 '22 18:09

kampta