Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

which similarity function of nltk.corpus.wordnet is Appropriate for find similarity of two words?

which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words?

 path_similarity()?
    lch_similarity()?
    wup_similarity()?
    res_similarity()?
    jcn_similarity()?
    lin_similarity()?

I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text.

like image 661
Masoud Abasian Avatar asked Sep 13 '11 10:09

Masoud Abasian


2 Answers

These measure are actually for word senses (or concepts) not words. That distinction might matter. In other words, the word "train" can mean "locomotive" or "being taught to do something". To use these measures you'd need to know which sense was intended.

If you want to do word clustering, these measures might not be exactly what you want...

like image 133
Ted Pedersen Avatar answered Sep 22 '22 16:09

Ted Pedersen


I've been playing with NLTK/wordnet myself for the purposes of trying to match up some texts in some automatic way. As Ted Pedersen's answer notes, it pretty quickly becomes clear that the similarity functions in nltk.corpus.wordnet only produce non-zero similarities for quite closely related terms with a solid IS-A pedigree.

What I ended up doing was taking the vocabulary in my texts, and then using lemma->synset->lemmas and lemma->similar_tos to grow my own word linkage graph (graph_tool fantastic for this) and then counting the minimum number of hops needed to link 2 words to get some sort of (dis-)similarity measure between them (quite entertaining to print these out; like watching a very bizarre word-association game). This did actually work well enough for my purposes even without any attempt to take POS/sense into account.

like image 41
timday Avatar answered Sep 24 '22 16:09

timday