General synonym and part of speech processing using nltk

Question

I'm trying to create a general synonym identifier for the words in a sentence which are significant (i.e. not "a" or "the"), and I am using the natural language toolkit(nltk) in python for it. The problem I am having is that the synonym finder in nltk requires a part of speech argument in order to be linked to its synonyms. My attempted fix for this was to use the simplified part of speech tagger present in nltk, and then reduce the first letter in order to pass this argument into the synonym finder, however this is not working.

def synonyms(Sentence):
    Keywords = []
    Equivalence = WordNetLemmatizer()
    Stemmer = stem.SnowballStemmer('english')
    for word in Sentence:
        word = Equivalence.lemmatize(word)
    words = nltk.word_tokenize(Sentence.lower())
    text = nltk.Text(words)
    tags = nltk.pos_tag(text)
    simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags]
    for tag in simplified_tags:
        print tag
        grammar_letter = tag[1][0].lower()
        if grammar_letter != 'd':
            Call = tag[0].strip() + "." + grammar_letter.strip() + ".01"
            print Call
            Word_Set = wordnet.synset(Call)
            paths = Word_Set.lemma_names
            for path in paths:
                Keywords.append(Stemmer.stem(path))
    return Keywords

This is the code I am currently working from, and as you can see I am first lemmatizing the input to reduce the number of matches I will have in the long run (I plan on running this on tens of thousands of sentences), and in theory I would be stemming the word after this to further this effect and reduce the number of redundant words I generate, however this method almost invariably returns errors in the form of the one below:

Traceback (most recent call last):
  File "C:\Python27	est.py", line 45, in <module>
    synonyms('spray reddish attack force')
  File "C:\Python27	est.py", line 39, in synonyms
    Word_Set = wordnet.synset(Call)
  File "C:\Python27\lib\site-packages
ltk\corpus
eader\wordnet.py", line 1016, in synset
    raise WordNetError(message % (lemma, pos))
WordNetError: no lemma 'reddish' with part of speech 'n'

I don't have much control over the data this will be running over, and so simply cleaning my corpus is not really an option. Any ideas on how to solve this one?

I did some more research and I have a promising lead, but I'm still not sure how I could implement it. In the case of a not found, or incorrectly assigned word I would like to use a similarity metric(Leacock Chodorow, Wu-Palmer etc.) to link the word to the closest correctly categorized other keyword. Perhaps in conjunction with an edit distance measure, but again I haven't been able to find any kind of documentation on this.

Slater Victoroff · Accepted Answer

Apparently nltk allows for the retrieval of all synsets associated with a word. Granted, there are usually a number of them reflecting different word senses. In order to functionally find synonyms (or if two words are synonyms) you must attempt to match the closest synonym set possible, which is possible through any of the similarity metrics mentioned above. I crafted up some basic code to do this, as shown below, how to find if two words are synonyms:

from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
import itertools


def Synonym_Checker(word1, word2):
    """Checks if word1 and word2 and synonyms. Returns True if they are, otherwise False"""
    equivalence = WordNetLemmatizer()
    word1 = equivalence.lemmatize(word1)
    word2 = equivalence.lemmatize(word2)

    word1_synonyms = wordnet.synsets(word1)
    word2_synonyms = wordnet.synsets(word2)

    scores = [i.wup_similarity(j) for i, j in list(itertools.product(word1_synonyms, word2_synonyms))]
    max_index = scores.index(max(scores))
    best_match = (max_index/len(word1_synonyms), max_index % len(word1_synonyms)-1)

    word1_set = word1_synonyms[best_match[0]].lemma_names
    word2_set = word2_synonyms[best_match[1]].lemma_names
    match = False
    match = [match or word in word2_set for word in word1_set][0]

    return match

print Synonym_Checker("tomato", "Lycopersicon_esculentum")

I may try to implement progressively stronger stemming algorithms, but for the first few tests I did, this code actually worked for every word I could find. If anyone has ideas on how to improve this algorithm, or has anything to improve this answer in any way I would love to hear it.

ChipJust · Answer

Can you wrap your Word_Set = wordnet.synset(Call) with a try: and ignore the WordNetError exception? Looks like the error you have is that some words are not categorized correctly, but this exception would also occur for unrecognized words, so catching the exception just seems like a good idea to me.

General synonym and part of speech processing using nltk

Tags:

python

machine-learning

nlp

nltk

wordnet

Slater Victoroff

2 Answers

Slater Victoroff

ChipJust

Recent Activity

Donate For Us

General synonym and part of speech processing using nltk

Tags:

python

machine-learning

nlp

nltk

wordnet

Slater Victoroff

2 Answers

Slater Victoroff

ChipJust

Related questions

Recent Activity

Donate For Us