I have a wordnet database setup, and I'm trying to generate synonyms for various words.
For example, the word, "greatest". I'll look through and find several different synonyms, but none of them really fit the definition - for example, one is "superlative".
I'm guessing that I need to do some sort of check by frequency in a given language or stemming a word to get the base word (for example, greatest -> great, great -> best).
What table should I be using to ensure my words make some modicum of sense?
Neither stemmer or lemmatizer can get you from greatest
-> great
:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.stem import WordNetLemmatizer, PorterStemmer
>>> porter = PorterStemmer()
>>> wnl = WordNetLemmatizer()
>>> greatest = 'greatest'
>>> porter.stem(greatest)
u'greatest'
>>> wnl.lemmatize(greatest)
'greatest'
>>> greater = 'greater'
>>> wnl.lemmatize(greater)
'greater'
>>> porter.stem(greater)
u'greater'
But seems like you can make use of some nice properties of the PennTreeBank tagset to get from greatest -> great
:
>>> from nltk import pos_tag
>>> pos_tag(['greatest'])
[('greatest', 'JJS')]
>>> pos_tag(['greater'])
[('greater', 'JJR')]
>>> pos_tag(['great'])
[('great', 'JJ')]
Let's try a crazy rule based system, let's start from greatest
:
>>> import re
>>> word1 = 'greatest'
>>> re.sub('est$', '', word1)
'great'
>>> re.sub('est$', 'er', word1)
'greater'
>>> pos_tag([re.sub('est$', '', word1)])[0][1]
'JJ'
>>> pos_tag([re.sub('est$', 'er', word1)])[0][1]
'JJR'
>>> word1
'greatest'
Now that we know that we can build our own little superlative stemmer/lemmatizer/tail_substituter, let's write a rule that says if a word gives a superlative POS tag and our tail_substituter
gives us JJ when we stem and JJR when we convert, we can safely say that the comparative and base form of the word can be easily gotten with our tail_substituter
:
>>> if pos_tag([word1])[0][1] == 'JJS' \
... and pos_tag([re.sub('est$', '', word1)])[0][1] == 'JJ' \
... and pos_tag([re.sub('est$', 'er', word1)])[0][1] == 'JJR':
... comparative = re.sub('est$', 'er', word1)
... adjective = re.sub('est$', '', word1)
...
>>> adjective
'great'
>>> comparative
'greater'
Now that gets you from greatest -> greater -> great
. From great -> best
is sort of weird, since lexically they're not not related although their semantics relative seems related.
So i think it would be subjective to say that great -> best
is a valid transformation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With