I came across several methods for measuring semantic similarity that use the structure and hierarchy of WordNet, e.g. Jiang and Conrath measure (JNC), Resnik measure(RES), Lin measure (LIN) etc.
The way they are measured using NLTK is:
sim2=wn.jcn_similarity(entry1,entry2,brown_ic)
sim3=entry1.res_similarity(entry2, brown_ic)
sim4=entry1.lin_similarity(entry2,brown_ic)
If WordNet is the basis of calculating semantic similarity, what is the use of Brown Corpus here?
Take a look at the explanation at the NLTK howto for wordnet.
Specifically, the *_ic notation is information content.
synset1.res_similarity(synset2, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.
A bit more info on information content from here:
The conventional way of measuring the IC of word senses is to combine knowledge of their hierarchical structure from an ontology like WordNet with statistics on their actual usage in text as derived from a large corpus
The brown_ic in your code refers to the information content file ~/nltk_data/corpora/wordnet_ic/ic-brown.dat. For more detail on the format of the ic-brown.dat, check out this thread from the NLTK-user group.
Overall, the ic-brown.dat file lists every word existing in the Brown corpus and their information content values (which are associated with word frequencies).
The semantic measures by JC, Resnik, and Lin all require the use of a corpus in addition to the WordNet. These measures combine WordNet with corpus statistics and they are shown to achieve better correlations to human judgment than using WordNet alone (Li 2006; Pedersen 2010).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With