Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python nltk returning odd result for wordnet similarity measure

I am trying to find similarity between two words using wordnet of python nltk. Two sample keyword is 'game' and 'leonardo'. First I have extracted all synsets of this two words and cross-matching each synset to find their similarity. Here is my code

from nltk.corpus import wordnet as wn

xx = wn.synsets("game")
yy = wn.synsets("leonardo")
for x in xx:
    for y in yy:
        print x.name
        print x.definition
        print y.name
        print y.definition
        print x.wup_similarity(y)
        print '\n'

Here is the total output:

game.n.01 a contest with rules to determine a winner leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714

game.n.02 a single play of a sport or other contest leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714

game.n.03 an amusement or pastime leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.25

game.n.04 animal hunted for food or sport leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.923076923077

game.n.05 (tennis) a division of play during which one player serves leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.222222222222

game.n.06 (games) the score at a particular point or the score needed to win leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714

game.n.07 the flesh of wild animals that is used for food leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.5

plot.n.01 a secret scheme to do something (especially something underhand or illegal) leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.2

game.n.09 the game equipment needed in order to play a particular game leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.666666666667

game.n.10 your occupation or line of work leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.25

game.n.11 frivolous or trifling behavior leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.222222222222

bet_on.v.01 place a bet on leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1

crippled.s.01 disabled in the feet or legs leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1

game.s.02 willing to face danger leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1

But the similarity between game.n.04 and leonardo.n.01 is really odd. I think the similarity (0.923076923077) should not be so high.

game.n.04

animal hunted for food or sport

leonardo.n.01

Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519)

0.923076923077

Is there any problem with my concept?

like image 514
Quazi Marufur Rahman Avatar asked Jun 25 '13 11:06

Quazi Marufur Rahman


People also ask

What does NLTK WordNet do?

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.

What is WordNet similarity?

WordNet::Similarity is a freely available software package that makes it possible to measure the semantic similarity and relatedness between a pair of concepts (or synsets). It provides six measures of similarity, and three measures of relatedness, all of which are based on the lexical database WordNet.

What is Synset in WordNet?

Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Synset and some have several.


1 Answers

According to the docs, the wup_similarity() method returns...

...a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

...and...

>>> from nltk.corpus import wordnet as wn
>>> game = wn.synset('game.n.04')
>>> leonardo = wn.synset('leonardo.n.01')
>>> game.lowest_common_hypernyms(leonardo)
[Synset('organism.n.01')]
>>> organism = game.lowest_common_hypernyms(leonardo)[0]
>>> game.shortest_path_distance(organism)
2
>>> leonardo.shortest_path_distance(organism)
3

...which is why it thinks they're similar, although I get...

>>> game.wup_similarity(leonardo)
0.7058823529411765

...which is different for some reason.


Update

I want some measurement which will show that dissimilarity('game', 'chess') is much much less than dissimilarity('game', 'leonardo')

How about something like this...

from nltk.corpus import wordnet as wn
from itertools import product

def compare(word1, word2):
    ss1 = wn.synsets(word1)
    ss2 = wn.synsets(word2)
    return max(s1.path_similarity(s2) for (s1, s2) in product(ss1, ss2))

for word1, word2 in (('game', 'leonardo'), ('game', 'chess')):
    print "Path similarity of %-10s and %-10s is %.2f" % (word1,
                                                          word2,
                                                          compare(word1, word2))

...which prints...

Path similarity of game       and leonardo   is 0.17
Path similarity of game       and chess      is 0.25
like image 102
Aya Avatar answered Sep 29 '22 10:09

Aya