Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python NLTK WUP Similarity Score not unity for exact same word

Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here?

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity
like image 556
Prophecies Avatar asked Sep 01 '15 14:09

Prophecies


People also ask

What is WUP similarity?

It calculates the similarity based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree.

How do you use WordNet in Python?

To use the Wordnet, at first we have to install the NLTK module, then download the WordNet package. In the wordnet, there are some groups of words, whose meaning are same. In the first example, we will see how wordnet returns meaning and other details of a word.

What is Synsets NLTK?

Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Synset and some have several.

Is NLTK similar to WordNet similarity?

This time it returns similarity values for verb senses as well and values are similar with WordNet::Similarity. But the problem is, NLTK also returns similarity value between a verb sense and a noun which is not a scenario in WordNet::Similarity.

Does NLTK return different values for different parts of speech?

And more importantly, if I change the order of words in path similarity function for NLTK, it returns completely different values for different part of speech and I do not observe this behavior in WordNet::Similarity. Following are the values NLTK returns when I change word order while calling path similarity function:

How Wu&Palmer similarity works?

How Wu & Palmer Similarity works ? It calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS (Least Common Subsumer). The score can be 0 < score <= 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of taxonomy is one).

How does it calculate the similarity between words?

It calculates the similarity based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree. hello and selling are apparently 27% similar! This is because they share common hypernyms further up the two. Code #3 : Let’s check the hypernyms in between.


1 Answers

This is an interesting problem.

TL;DR:

Sorry there's no short answer to this problem =(


Too long, want to read:

Looking at the code for wup_similarity(), the problem comes from not the similarity calculations but the way NLTK traverse the WordNet hierarchies to get the lowest_common_hypernym() (see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).

Normally, the lowest common hypernyms between a synset and itself would have to be itself:

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]

But in the case of orange it gives fruit too:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]

We'll have to take a look at the code for the lowest_common_hypernym(), from the docstring of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805

Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned

So let's try the lowest_common_hypernym() with use_min_depth=False:

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]

Seems like that resolves the ambiguity of the tied path. But the wup_similarity() API doesn't have the use_min_depth parameter:

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'

Note the difference is that when use_min_depth==False, the lowest_common_hypernym checks for maximum depth while traversing synsets. But when use_min_depth==True, it checks for minimum depth, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

So if we trace the lowest_common_hypernym code:

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]

# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]

This weird phenomena with wup_similarity is actually highlighted in the code comments, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)

And when the first subsumer in the list is selected at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:

subsumer = subsumers[0]

Naturally, in the case of orange synset, fruit is selected first sense it's first of the list that have tied lowest common hypernyms.

To conclude, the default parameter is sort of a feature not a bug to maintain the reproducibility as with NLTK v2.x.

So the solution might be to either manually change the NLTK source to force use_min_depth=False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845


EDITED

To resolve the problem, possibly you can do an ad-hoc check for same synset:

def wup_similarity_hacked(synset1, synset2):
  if synset1 == synset2:
    return 1.0
  else:
    return synset1.wup_similarity(synset2)
like image 83
alvas Avatar answered Nov 05 '22 18:11

alvas