Python NLTK WUP Similarity Score not unity for exact same word

Tags:

Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here?

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity

556

asked Sep 01 '15 14:09

Prophecies

1 Answers

This is an interesting problem.

TL;DR:

Sorry there's no short answer to this problem =(

Too long, want to read:

Looking at the code for wup_similarity(), the problem comes from not the similarity calculations but the way NLTK traverse the WordNet hierarchies to get the lowest_common_hypernym() (see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).

Normally, the lowest common hypernyms between a synset and itself would have to be itself:

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]

But in the case of orange it gives fruit too:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]

We'll have to take a look at the code for the lowest_common_hypernym(), from the docstring of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805

Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned

So let's try the lowest_common_hypernym() with use_min_depth=False:

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]

Seems like that resolves the ambiguity of the tied path. But the wup_similarity() API doesn't have the use_min_depth parameter:

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'

Note the difference is that when use_min_depth==False, the lowest_common_hypernym checks for maximum depth while traversing synsets. But when use_min_depth==True, it checks for minimum depth, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

So if we trace the lowest_common_hypernym code:

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]

# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]

This weird phenomena with wup_similarity is actually highlighted in the code comments, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)

And when the first subsumer in the list is selected at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:

subsumer = subsumers[0]

Naturally, in the case of orange synset, fruit is selected first sense it's first of the list that have tied lowest common hypernyms.

To conclude, the default parameter is sort of a feature not a bug to maintain the reproducibility as with NLTK v2.x.

So the solution might be to either manually change the NLTK source to force use_min_depth=False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845

EDITED

To resolve the problem, possibly you can do an ad-hoc check for same synset:

def wup_similarity_hacked(synset1, synset2):
  if synset1 == synset2:
    return 1.0
  else:
    return synset1.wup_similarity(synset2)

answered Nov 05 '22 18:11

alvas

Related questions
                            
                                Pylint not working within Spyder
                            
                                How do I print to the console instead of an iPython output cell?
                            
                                Suppress console message from predict of Scikit learn
                            
                                What is the best approach to use Web Sockets with Django projects?
                            
                                Does GridSearchCV not support multi-class?
                            
                                numpy and pandas timedelta error
                            
                                PyQt QTableView prohibitively slow when scrolling with large data sets
                            
                                In scikit's precision_recall_curve, why does thresholds have a different dimension from recall and precision?
                            
                                When to use event/condition/lock/semaphore in python's threading module?
                            
                                Difference between _sql_constraints and _constraints on OpenERP/Odoo?
                            
                                How to persist 'ln' in Docker with Ubuntu
                            
                                Add bias to Lasagne neural network layers
                            
                                Django-mptt model serialize with Django REST framework
                            
                                Fetch all pages using python request
                            
                                Efficient Python implementation of numpy array comparisons
                            
                                Ignore the rest of the python file
                            
                                Keeping 80 chars margin for long with statement?
                            
                                Python: Sorting y value array according to ascending x array
                            
                                Adding Javascript to Custom widgets
                            
                                Problems with a binary one-hot (one-of-K) coding in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python NLTK WUP Similarity Score not unity for exact same word

Tags:

python

nlp

similarity

nltk

Prophecies

People also ask

1 Answers

alvas

Recent Activity

Donate For Us