I'm trying to create a general synonym identifier for the words in a sentence which are significant (i.e. not "a" or "the"), and I am using the natural language toolkit(nltk) in python for it. The problem I am having is that the synonym finder in nltk requires a part of speech argument in order to be linked to its synonyms. My attempted fix for this was to use the simplified part of speech tagger present in nltk, and then reduce the first letter in order to pass this argument into the synonym finder, however this is not working.
def synonyms(Sentence):
Keywords = []
Equivalence = WordNetLemmatizer()
Stemmer = stem.SnowballStemmer('english')
for word in Sentence:
word = Equivalence.lemmatize(word)
words = nltk.word_tokenize(Sentence.lower())
text = nltk.Text(words)
tags = nltk.pos_tag(text)
simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags]
for tag in simplified_tags:
print tag
grammar_letter = tag[1][0].lower()
if grammar_letter != 'd':
Call = tag[0].strip() + "." + grammar_letter.strip() + ".01"
print Call
Word_Set = wordnet.synset(Call)
paths = Word_Set.lemma_names
for path in paths:
Keywords.append(Stemmer.stem(path))
return Keywords
This is the code I am currently working from, and as you can see I am first lemmatizing the input to reduce the number of matches I will have in the long run (I plan on running this on tens of thousands of sentences), and in theory I would be stemming the word after this to further this effect and reduce the number of redundant words I generate, however this method almost invariably returns errors in the form of the one below:
Traceback (most recent call last):
File "C:\Python27\test.py", line 45, in <module>
synonyms('spray reddish attack force')
File "C:\Python27\test.py", line 39, in synonyms
Word_Set = wordnet.synset(Call)
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1016, in synset
raise WordNetError(message % (lemma, pos))
WordNetError: no lemma 'reddish' with part of speech 'n'
I don't have much control over the data this will be running over, and so simply cleaning my corpus is not really an option. Any ideas on how to solve this one?
I did some more research and I have a promising lead, but I'm still not sure how I could implement it. In the case of a not found, or incorrectly assigned word I would like to use a similarity metric(Leacock Chodorow, Wu-Palmer etc.) to link the word to the closest correctly categorized other keyword. Perhaps in conjunction with an edit distance measure, but again I haven't been able to find any kind of documentation on this.
Apparently nltk allows for the retrieval of all synsets associated with a word. Granted, there are usually a number of them reflecting different word senses. In order to functionally find synonyms (or if two words are synonyms) you must attempt to match the closest synonym set possible, which is possible through any of the similarity metrics mentioned above. I crafted up some basic code to do this, as shown below, how to find if two words are synonyms:
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
import itertools
def Synonym_Checker(word1, word2):
"""Checks if word1 and word2 and synonyms. Returns True if they are, otherwise False"""
equivalence = WordNetLemmatizer()
word1 = equivalence.lemmatize(word1)
word2 = equivalence.lemmatize(word2)
word1_synonyms = wordnet.synsets(word1)
word2_synonyms = wordnet.synsets(word2)
scores = [i.wup_similarity(j) for i, j in list(itertools.product(word1_synonyms, word2_synonyms))]
max_index = scores.index(max(scores))
best_match = (max_index/len(word1_synonyms), max_index % len(word1_synonyms)-1)
word1_set = word1_synonyms[best_match[0]].lemma_names
word2_set = word2_synonyms[best_match[1]].lemma_names
match = False
match = [match or word in word2_set for word in word1_set][0]
return match
print Synonym_Checker("tomato", "Lycopersicon_esculentum")
I may try to implement progressively stronger stemming algorithms, but for the first few tests I did, this code actually worked for every word I could find. If anyone has ideas on how to improve this algorithm, or has anything to improve this answer in any way I would love to hear it.
Can you wrap your Word_Set = wordnet.synset(Call)
with a try:
and ignore the WordNetError
exception? Looks like the error you have is that some words are not categorized correctly, but this exception would also occur for unrecognized words, so catching the exception just seems like a good idea to me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With