which similarity function in nltk.corpus.wordnet
is Appropriate for find similarity of two words?
path_similarity()?
lch_similarity()?
wup_similarity()?
res_similarity()?
jcn_similarity()?
lin_similarity()?
I want use a function for word clustering
and yarowsky
algorightm for find similar collocation
in a large text.
These measure are actually for word senses (or concepts) not words. That distinction might matter. In other words, the word "train" can mean "locomotive" or "being taught to do something". To use these measures you'd need to know which sense was intended.
If you want to do word clustering, these measures might not be exactly what you want...
I've been playing with NLTK/wordnet myself for the purposes of trying to match up some texts in some automatic way. As Ted Pedersen's answer notes, it pretty quickly becomes clear that the similarity functions in nltk.corpus.wordnet
only produce non-zero similarities for quite closely related terms with a solid IS-A pedigree.
What I ended up doing was taking the vocabulary in my texts, and then using lemma->synset->lemmas and lemma->similar_tos to grow my own word linkage graph (graph_tool
fantastic for this) and then counting the minimum number of hops needed to link 2 words to get some sort of (dis-)similarity measure between them (quite entertaining to print these out; like watching a very bizarre word-association game). This did actually work well enough for my purposes even without any attempt to take POS/sense into account.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With