Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WordNet Python words similarity

I'm trying to find a reliable way to measure the semantic similarity of 2 terms. The first metric could be the path distance on a hyponym/hypernym graph (eventually a linear combination of 2-3 metrics could be better..).

from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print(dog.path_similarity(cat))
  • I still don't get what n.01 means and why it's necessary.
  • there is a way to visually show the computed path between 2 terms?
  • Which other nltk semantic metric could I use?
like image 296
alfredopacino Avatar asked Jan 22 '17 17:01

alfredopacino


People also ask

What is WordNet similarity?

WordNet::Similarity is a freely available software package that makes it possible to measure the semantic similarity or relatedness between a pair of concepts (or word senses). It provides six measures of similarity, and three measures of relatedness, all of which are based on the lexical database WordNet.

What is path similarity in WordNet?

Path-based Similarity: It is a similarity measure that finds the distance that is the length of the shortest path between two synsets.

What is WUP similarity?

It calculates the similarity based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree.


1 Answers

1. I still don't get what n.01 means and why it's necessary.

from here and the source of nltk shows that the result is "WORD.PART-OF-SPEECH.SENSE-NUMBER"

quoting the source:

Create a Lemma from a "<word>.<pos>.<number>.<lemma>" string where:
<word> is the morphological stem identifying the synset
<pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB
<number> is the sense number, counting from 0.
<lemma> is the morphological form of interest

n means Noun, I also suggest reading about wordnet dataset.

2. there is a way to visually show the computed path between 2 terms?

please look at the nltk wordnet docs on similarity section. you have several choices for path algorithms there (you can try mixing several).

few examples from nltk docs:

from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')

print(dog.path_similarity(cat))
print(dog.lch_similarity(cat))
print(dog.wup_similarity(cat))

for the visualization you can build a distance matrix M[i,j] where:

M[i,j] = word_similarity(i, j)

and use the following stackoverflow answer to draw the visualization.

3. Which other nltk semantic metric could I use?

As mentioned above, there are several ways to calculate the word similarities. I also suggest looking into gensim. I used its word2vec implementation for word similarities and it worked well for me.

if you need any help choosing algorithms please provide more info about the problem you are facing.

Update:

More info about word sense number meaning can be found here:

Senses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1...

the problem is that "dog" is ambiguous and you must choose the right meaning for it.

you might choose the first sense as naive approach or find your own algorithm for choosing the right meaning depend on your application or research.

to get all available definitions (called synsets on wordnet docs) of a word from wordnet you could simply call wn.synsets(word).

I encourage you to dig into the metadata contained inside these synset for each definition.

the code below shows a simple example to get this metadata and prints it nicely.

from nltk.corpus import wordnet as wn

dog_synsets = wn.synsets('dog')

for i, syn in enumerate(dog_synsets):
    print('%d. %s' % (i, syn.name()))
    print('alternative names (lemmas): "%s"' % '", "'.join(syn.lemma_names()))
    print('definition: "%s"' % syn.definition())
    if syn.examples():
        print('example usage: "%s"' % '", "'.join(syn.examples()))
    print('\n')

code output:

0. dog.n.01
alternative names (lemmas): "dog", "domestic_dog", "Canis_familiaris"
definition: "a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds"
example usage: "the dog barked all night"


1. frump.n.01
alternative names (lemmas): "frump", "dog"
definition: "a dull unattractive unpleasant girl or woman"
example usage: "she got a reputation as a frump", "she's a real dog"


2. dog.n.03
alternative names (lemmas): "dog"
definition: "informal term for a man"
example usage: "you lucky dog"


3. cad.n.01
alternative names (lemmas): "cad", "bounder", "blackguard", "dog", "hound", "heel"
definition: "someone who is morally reprehensible"
example usage: "you dirty dog"


4. frank.n.02
alternative names (lemmas): "frank", "frankfurter", "hotdog", "hot_dog", "dog", "wiener", "wienerwurst", "weenie"
definition: "a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll"


5. pawl.n.01
alternative names (lemmas): "pawl", "detent", "click", "dog"
definition: "a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward"


6. andiron.n.01
alternative names (lemmas): "andiron", "firedog", "dog", "dog-iron"
definition: "metal supports for logs in a fireplace"
example usage: "the andirons were too hot to touch"


7. chase.v.01
alternative names (lemmas): "chase", "chase_after", "trail", "tail", "tag", "give_chase", "dog", "go_after", "track"
definition: "go after with the intent to catch"
example usage: "The policeman chased the mugger down the alley", "the dog chased the rabbit"
like image 189
ShmulikA Avatar answered Sep 19 '22 18:09

ShmulikA