Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

General synonym and part of speech processing using nltk

I'm trying to create a general synonym identifier for the words in a sentence which are significant (i.e. not "a" or "the"), and I am using the natural language toolkit(nltk) in python for it. The problem I am having is that the synonym finder in nltk requires a part of speech argument in order to be linked to its synonyms. My attempted fix for this was to use the simplified part of speech tagger present in nltk, and then reduce the first letter in order to pass this argument into the synonym finder, however this is not working.

def synonyms(Sentence):
    Keywords = []
    Equivalence = WordNetLemmatizer()
    Stemmer = stem.SnowballStemmer('english')
    for word in Sentence:
        word = Equivalence.lemmatize(word)
    words = nltk.word_tokenize(Sentence.lower())
    text = nltk.Text(words)
    tags = nltk.pos_tag(text)
    simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags]
    for tag in simplified_tags:
        print tag
        grammar_letter = tag[1][0].lower()
        if grammar_letter != 'd':
            Call = tag[0].strip() + "." + grammar_letter.strip() + ".01"
            print Call
            Word_Set = wordnet.synset(Call)
            paths = Word_Set.lemma_names
            for path in paths:
                Keywords.append(Stemmer.stem(path))
    return Keywords

This is the code I am currently working from, and as you can see I am first lemmatizing the input to reduce the number of matches I will have in the long run (I plan on running this on tens of thousands of sentences), and in theory I would be stemming the word after this to further this effect and reduce the number of redundant words I generate, however this method almost invariably returns errors in the form of the one below:

Traceback (most recent call last):
  File "C:\Python27\test.py", line 45, in <module>
    synonyms('spray reddish attack force')
  File "C:\Python27\test.py", line 39, in synonyms
    Word_Set = wordnet.synset(Call)
  File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1016, in synset
    raise WordNetError(message % (lemma, pos))
WordNetError: no lemma 'reddish' with part of speech 'n'

I don't have much control over the data this will be running over, and so simply cleaning my corpus is not really an option. Any ideas on how to solve this one?

I did some more research and I have a promising lead, but I'm still not sure how I could implement it. In the case of a not found, or incorrectly assigned word I would like to use a similarity metric(Leacock Chodorow, Wu-Palmer etc.) to link the word to the closest correctly categorized other keyword. Perhaps in conjunction with an edit distance measure, but again I haven't been able to find any kind of documentation on this.

like image 998
Slater Victoroff Avatar asked Jun 12 '12 22:06

Slater Victoroff


2 Answers

Apparently nltk allows for the retrieval of all synsets associated with a word. Granted, there are usually a number of them reflecting different word senses. In order to functionally find synonyms (or if two words are synonyms) you must attempt to match the closest synonym set possible, which is possible through any of the similarity metrics mentioned above. I crafted up some basic code to do this, as shown below, how to find if two words are synonyms:

from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
import itertools


def Synonym_Checker(word1, word2):
    """Checks if word1 and word2 and synonyms. Returns True if they are, otherwise False"""
    equivalence = WordNetLemmatizer()
    word1 = equivalence.lemmatize(word1)
    word2 = equivalence.lemmatize(word2)

    word1_synonyms = wordnet.synsets(word1)
    word2_synonyms = wordnet.synsets(word2)

    scores = [i.wup_similarity(j) for i, j in list(itertools.product(word1_synonyms, word2_synonyms))]
    max_index = scores.index(max(scores))
    best_match = (max_index/len(word1_synonyms), max_index % len(word1_synonyms)-1)

    word1_set = word1_synonyms[best_match[0]].lemma_names
    word2_set = word2_synonyms[best_match[1]].lemma_names
    match = False
    match = [match or word in word2_set for word in word1_set][0]

    return match

print Synonym_Checker("tomato", "Lycopersicon_esculentum")

I may try to implement progressively stronger stemming algorithms, but for the first few tests I did, this code actually worked for every word I could find. If anyone has ideas on how to improve this algorithm, or has anything to improve this answer in any way I would love to hear it.

like image 121
Slater Victoroff Avatar answered Oct 12 '22 23:10

Slater Victoroff


Can you wrap your Word_Set = wordnet.synset(Call) with a try: and ignore the WordNetError exception? Looks like the error you have is that some words are not categorized correctly, but this exception would also occur for unrecognized words, so catching the exception just seems like a good idea to me.

like image 20
ChipJust Avatar answered Oct 12 '22 23:10

ChipJust