I wanted to know the difference between gensim word2vec's two similarity measures : most_similar() and most_similar_cosmul(). I know that the first one works using cosine similarity of word vectors while other one uses using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg. I want to know how it affects the results? Which one gives semantic similarity ? etc. Eg :
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
Result : [('queen', 0.50882536), ...]
model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])
Result : [(u'iraq', 0.8488819003105164), ...]
From the Levy and Goldberg paper, if you are trying to find analogies (or combining/comparing more than 2 word vectors), the first method (3CosAdd or eq.3 of paper) is more susceptible of getting dominated by 1 comparison, as compared to second method (3CosMul or eq.4 of paper).
Just for semantic similarity between 2 word vectors, this method doesn't apply.
Example, using Google News Vectors -
model.similarity('Mosul','England')
0.10051745730111421
model.similarity('Iraq','England')
0.14772211471143404
model.similarity('Mosul','Baghdad')
0.83855779792754492
model.similarity('Iraq','Baghdad')
0.67975755642668911
Now Iraq is closer to England than Mosul (both being countries), however similarity values are small ~ 0.1.
On the other hand Mosul is more similar to Baghdad than Iraq (geographical/cultural aspects), with similarity values of higher order ~ 0.7
Now, for analogy (England - London + Baghdad = X) -
3CosAdd being a linear sum, allows one large similarity term to dominate the expression. It ignores that each term reflects a different aspect of similarity, and the different aspects have different scales.
3CosMul, on the other hand - amplifies the differences between small quantities and reduces the differences between larger ones.
model.most_similar(positive=['Baghdad', 'England'], negative=['London'])
(u'Mosul', 0.5630180835723877)
(u'Iraq', 0.5184929370880127)
model.most_similar_cosmul(positive=['Baghdad', 'England'], negative=['London'])
(u'Mosul', 0.8537653088569641)
(u'Iraq', 0.8507866263389587)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With