I am trying to calculate semantic similarity between two words. I am using Wordnet-based similarity measures i.e Resnik measure(RES), Lin measure(LIN), Jiang and Conrath measure(JNC) and Banerjee and Pederson measure(BNP).
To do that, I am using nltk and Wordnet 3.0. Next, I want to combine the similarity values obtained from different measure. To do that i need to normalize the similarity values as some measure give values between 0 and 1, while others give values greater than 1.
So, my question is how do I normalize the similarity values obtained from different measures.
Extra detail on what I am actually trying to do: I have a set of words. I calculate pairwise similarity between the words. and remove the words that are not strongly correlated with other words in the set.
print wn.synset('gorgeous.a.01').wup_similarity(wn.synset('amazing.a.01')) # None (!!!) We’ve built a symmetric sentence similarity measure. There are several issues with how Wordnet computes word similarity. Although the method has a lot of drawbacks, it performs fairly well.
It calculates the similarity based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree. hello and selling are apparently 27% similar! This is because they share common hypernyms further up the two. Code #3 : Let’s check the hypernyms in between.
It calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS (Least Common Subsumer). The score can be 0 < score <= 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of taxonomy is one).
One of the core metrics used to calculate similarity is the shortest path the distance between the two Synsets and their common hypernym. Code #4 : Let’s understand the use of hypernerm. Note : The similarity score is very high i.e. they are many steps away from each other because they are not so similar.
Let's consider a single arbitrary similarity measure M
and take an arbitrary word w
.
Define m = M(w,w)
. Then m takes maximum possible value of M
.
Let's define MN
as a normalized measure M
.
For any two words w, u
you can compute MN(w, u) = M(w, u) / m
.
It's easy to see that if M
takes non-negative values, then MN
takes values in [0, 1]
.
In order to compute your own defined measure F
combined of k different measures m_1, m_2, ..., m_k
first normalize independently each m_i
using above method and then define:
alpha_1, alpha_2, ..., alpha_k
such that alpha_i
denotes the weight of i-th measure.
All alphas must sum up to 1, i.e:
alpha_1 + alpha_2 + ... + alpha_k = 1
Then to compute your own measure for w, u
you do:
F(w, u) = alpha_1 * m_1(w, u) + alpha_2 * m_2(w, u) + ... + alpha_k * m_k(w, u)
It's clear that F
takes values in [0,1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With