Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is wordnet path similarity commutative?

I am using the wordnet API from nltk. When I compare one synset with another I got None but when I compare them the other way around I get a float value.

Shouldn't they give the same value? Is there an explanation or is this a bug of wordnet?

Example:

wn.synset('car.n.01').path_similarity(wn.synset('automobile.v.01')) # None
wn.synset('automobile.v.01').path_similarity(wn.synset('car.n.01')) # 0.06666666666666667
like image 719
etzourid Avatar asked Nov 19 '13 15:11

etzourid


People also ask

Is WordNet a corpus?

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.

What are Synsets in WordNet?

Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Synset and some have several.

What is NLTK WordNet?

The WordNet is a part of Python's Natural Language Toolkit. It is a large word database of English Nouns, Adjectives, Adverbs and Verbs. These are grouped into some set of cognitive synonyms, which are called synsets. To use the Wordnet, at first we have to install the NLTK module, then download the WordNet package.


1 Answers

Technically without the dummy root, both car and automobile synsets would have no link to each other:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synset('car.n.01')
>>> y = wn.synset('automobile.v.01')
>>> print x.shortest_path_distance(y)
None
>>> print y.shortest_path_distance(x)
None

Now, let's look at the dummy root issue closely. Firstly, there is a neat function in NLTK that says whether a synset needs a dummy root:

>>> x._needs_root()
False
>>> y._needs_root()
True

Next, when you look at the path_similarity code (http://nltk.googlecode.com/svn-/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#Synset.path_similarity), you can see:

def path_similarity(self, other, verbose=False, simulate_root=True):
  distance = self.shortest_path_distance(other, \
               simulate_root=simulate_root and self._needs_root())

  if distance is None or distance < 0:
    return None
  return 1.0 / (distance + 1)

So for automobile synset, this parameter simulate_root=simulate_root and self._needs_root() will always be True when you try y.path_similarity(x) and when you try x.path_similarity(y) it will always be False since x._needs_root() is False:

>>> True and y._needs_root()
True
>>> True and x._needs_root()
False

Now when path_similarity() pass down to shortest_path_distance() (https://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#Synset.shortest_path_distance) and then to hypernym_distances(), it will try to call for a list of hypernyms to check their distances, without simulate_root = True, the automobile synset will not connect to the car and vice versa:

>>> y.hypernym_distances(simulate_root=True)
set([(Synset('automobile.v.01'), 0), (Synset('*ROOT*'), 2), (Synset('travel.v.01'), 1)])
>>> y.hypernym_distances()
set([(Synset('automobile.v.01'), 0), (Synset('travel.v.01'), 1)])
>>> x.hypernym_distances()
set([(Synset('object.n.01'), 8), (Synset('self-propelled_vehicle.n.01'), 2), (Synset('whole.n.02'), 8), (Synset('artifact.n.01'), 7), (Synset('physical_entity.n.01'), 10), (Synset('entity.n.01'), 11), (Synset('object.n.01'), 9), (Synset('instrumentality.n.03'), 5), (Synset('motor_vehicle.n.01'), 1), (Synset('vehicle.n.01'), 4), (Synset('entity.n.01'), 10), (Synset('physical_entity.n.01'), 9), (Synset('whole.n.02'), 7), (Synset('conveyance.n.03'), 5), (Synset('wheeled_vehicle.n.01'), 3), (Synset('artifact.n.01'), 6), (Synset('car.n.01'), 0), (Synset('container.n.01'), 4), (Synset('instrumentality.n.03'), 6)])

So theoretically, the right path_similarity is 0 / None , but because of the simulate_root=simulate_root and self._needs_root() parameter,

nltk.corpus.wordnet.path_similarity() in NLTK's API is not commutative.

BUT the code is also not wrong/bugged, since comparison of any synset distance by going through the root will be constantly far since the position of the dummy *ROOT* will never change, so the best of practice is to do this to calculate path_similarity:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synset('car.n.01')
>>> y = wn.synset('automobile.v.01')

# When you NEVER want a non-zero value, since going to 
# the *ROOT* will always get you some sort of distance 
# from synset x to synset y
>>> max(wn.path_similarity(x,y), wn.path_similarity(y,x))

# when you can allow None in synset similarity comparison
>>> min(wn.path_similarity(x,y), wn.path_similarity(y,x))
like image 186
alvas Avatar answered Oct 05 '22 20:10

alvas