Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check the similarity between two words with NLTK with Python

I have a two lists and I want to check the similarity between each words in the two list and find out the maximum similarity.Here is my code,

from nltk.corpus import wordnet

list1 = ['Compare', 'require']
list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show', 'spell', 'state', 'tell', 'trace', 'write']
list = []

for word1 in list1:
    for word2 in list2:
        wordFromList1 = wordnet.synsets(word1)[0]
        wordFromList2 = wordnet.synsets(word2)[0]
        s = wordFromList1.wup_similarity(wordFromList2)
        list.append(s)

print(max(list)) 

But this will result an error:

wordFromList2 = wordnet.synsets(word2)[0]
        IndexError: list index out of range

Please help me to fix this.
Thanking you

like image 787
Punuth Avatar asked Jun 14 '15 12:06

Punuth


People also ask

How do you find the similarity between two words in Python?

Word similarity is a number between 0 to 1 which tells us how close two words are, semantically. This is done by finding similarity between word vectors in the vector space. spaCy, one of the fastest NLP libraries widely used today, provides a simple method for this task.

How do you find the similarity between two sentences in NLP?

The easiest way of estimating the semantic similarity between a pair of sentences is by taking the average of the word embeddings of all words in the two sentences, and calculating the cosine between the resulting embeddings.

How do you find the semantic similarity between two sentences in Python?

Calculate semantic similarity between two sentences As discussed in the introduction, the approach is to use the model to encode the two sentences, and then calculating the cosine similarity of the resulting two embeddings. The final result will be the semantic similarity score.

What is word similarity in NLP?

Text Similarity In Natural Language Processing (NLP), the answer to “how two words/phrases/documents are similar to each other?” is a crucial topic for research and applications. Text similarity is to calculate how two words/phrases/documents are close to each other. That closeness may be lexical or in meaning.


2 Answers

You're getting an error if a synset list is empty, and you try to get the element at (non-existent) index zero. But why only check the zero'th element? If you want to check everything, try all pairs of elements in the returned synsets. You can use itertools.product() to save yourself two for-loops:

from itertools import product
sims = []

for word1, word2 in product(list1, list2):
    syns1 = wordnet.synsets(word1)
    syns2 = wordnet.synsets(word2)
    for sense1, sense2 in product(syns1, syns2):
        d = wordnet.wup_similarity(sense1, sense2)
        sims.append((d, syns1, syns2))

This is inefficient because the same synsets are looked up again and again, but it is the closest to the logic of your code. If you have enough data to make speed an issue, you can speed it up by collecting the synsets for all words in list1 and list2 once, and taking the product of the synsets.

>>> allsyns1 = set(ss for word in list1 for ss in wordnet.synsets(word))
>>> allsyns2 = set(ss for word in list2 for ss in wordnet.synsets(word))
>>> best = max((wordnet.wup_similarity(s1, s2) or 0, s1, s2) for s1, s2 in 
        product(allsyns1, allsyns2))
>>> print(best)
(0.9411764705882353, Synset('command.v.02'), Synset('order.v.01'))
like image 168
alexis Avatar answered Oct 12 '22 22:10

alexis


Try checking whether these lists are empty before you use then:

from nltk.corpus import wordnet

list1 = ['Compare', 'require']
list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show', 'spell', 'state', 'tell', 'trace', 'write']
list = []

for word1 in list1:
    for word2 in list2:
        wordFromList1 = wordnet.synsets(word1)
        wordFromList2 = wordnet.synsets(word2)
        if wordFromList1 and wordFromList2: #Thanks to @alexis' note
            s = wordFromList1[0].wup_similarity(wordFromList2[0])
            list.append(s)

print(max(list))
like image 22
omerbp Avatar answered Oct 12 '22 22:10

omerbp