I am using the wordnet API from nltk.
When I compare one synset with another I got None
but when I compare them the other way around I get a float value.
Shouldn't they give the same value? Is there an explanation or is this a bug of wordnet?
Example:
wn.synset('car.n.01').path_similarity(wn.synset('automobile.v.01')) # None
wn.synset('automobile.v.01').path_similarity(wn.synset('car.n.01')) # 0.06666666666666667
WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.
Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Synset and some have several.
The WordNet is a part of Python's Natural Language Toolkit. It is a large word database of English Nouns, Adjectives, Adverbs and Verbs. These are grouped into some set of cognitive synonyms, which are called synsets. To use the Wordnet, at first we have to install the NLTK module, then download the WordNet package.
Technically without the dummy root, both car
and automobile
synsets would have no link to each other:
>>> from nltk.corpus import wordnet as wn
>>> x = wn.synset('car.n.01')
>>> y = wn.synset('automobile.v.01')
>>> print x.shortest_path_distance(y)
None
>>> print y.shortest_path_distance(x)
None
Now, let's look at the dummy root issue closely. Firstly, there is a neat function in NLTK that says whether a synset needs a dummy root:
>>> x._needs_root()
False
>>> y._needs_root()
True
Next, when you look at the path_similarity
code (http://nltk.googlecode.com/svn-/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#Synset.path_similarity), you can see:
def path_similarity(self, other, verbose=False, simulate_root=True):
distance = self.shortest_path_distance(other, \
simulate_root=simulate_root and self._needs_root())
if distance is None or distance < 0:
return None
return 1.0 / (distance + 1)
So for automobile
synset, this parameter simulate_root=simulate_root and self._needs_root()
will always be True
when you try y.path_similarity(x)
and when you try x.path_similarity(y)
it will always be False
since x._needs_root()
is False
:
>>> True and y._needs_root()
True
>>> True and x._needs_root()
False
Now when path_similarity()
pass down to shortest_path_distance()
(https://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#Synset.shortest_path_distance) and then to hypernym_distances()
, it will try to call for a list of hypernyms to check their distances, without simulate_root = True
, the automobile
synset will not connect to the car
and vice versa:
>>> y.hypernym_distances(simulate_root=True)
set([(Synset('automobile.v.01'), 0), (Synset('*ROOT*'), 2), (Synset('travel.v.01'), 1)])
>>> y.hypernym_distances()
set([(Synset('automobile.v.01'), 0), (Synset('travel.v.01'), 1)])
>>> x.hypernym_distances()
set([(Synset('object.n.01'), 8), (Synset('self-propelled_vehicle.n.01'), 2), (Synset('whole.n.02'), 8), (Synset('artifact.n.01'), 7), (Synset('physical_entity.n.01'), 10), (Synset('entity.n.01'), 11), (Synset('object.n.01'), 9), (Synset('instrumentality.n.03'), 5), (Synset('motor_vehicle.n.01'), 1), (Synset('vehicle.n.01'), 4), (Synset('entity.n.01'), 10), (Synset('physical_entity.n.01'), 9), (Synset('whole.n.02'), 7), (Synset('conveyance.n.03'), 5), (Synset('wheeled_vehicle.n.01'), 3), (Synset('artifact.n.01'), 6), (Synset('car.n.01'), 0), (Synset('container.n.01'), 4), (Synset('instrumentality.n.03'), 6)])
So theoretically, the right path_similarity
is 0 / None , but because of the simulate_root=simulate_root and self._needs_root()
parameter,
nltk.corpus.wordnet.path_similarity()
in NLTK's API is not commutative.
BUT the code is also not wrong/bugged, since comparison of any synset distance by going through the root will be constantly far since the position of the dummy *ROOT*
will never change, so the best of practice is to do this to calculate path_similarity:
>>> from nltk.corpus import wordnet as wn
>>> x = wn.synset('car.n.01')
>>> y = wn.synset('automobile.v.01')
# When you NEVER want a non-zero value, since going to
# the *ROOT* will always get you some sort of distance
# from synset x to synset y
>>> max(wn.path_similarity(x,y), wn.path_similarity(y,x))
# when you can allow None in synset similarity comparison
>>> min(wn.path_similarity(x,y), wn.path_similarity(y,x))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With