When I open verb.exc, I can see
saw see
While I use lemmatization in code
>>>print lmtzr.lemmatize('saw', 'v')
saw
How can this happen? Do I misunderstand in revising wordNet?
In short:
It's sort of a strange case of exception.
There's also a case where I saw the log the into half.
where "saw" is a present tense verb.
See @nschneid solution to use more fine-grain tags in the issue raised: https://github.com/nltk/nltk/issues/1196
In long:
If we take a look at how we call the WordNet lemmatizer in NLTK:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('saw', pos='v')
'saw'
>>> wnl.lemmatize('saw')
'saw'
Specifying the POS tag seems redundant. Let's take a look at the lemmatizer code itself:
class WordNetLemmatizer(object):
def __init__(self):
pass
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
What it does is it relies on the _moprhy
property of the wordnet corpus to return possible lemmas.
If we thread through the nltk.corpus.wordnet
code, we see the _morphy()
code at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1679
The first few lines of the function reads the exception file from wordnet's verb.exc
, i.e. https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1687
So if we do an ad-hoc search of the exception outside of the lemmatizer function, we do see that 'saw' -> 'see'
:
>>> from nltk.corpus import wordnet as wn
>>> exceptions = wn._exception_map['v']
>>> exceptions['saw']
[u'see']
So if we call the _morphy()
function outside of the lemmatizer:
>>> from nltk.corpus import wordnet as wn
>>> exceptions = wn._exception_map['v']
>>> wn._morphy('saw', 'v')
['saw', u'see']
Let's go back to the return line of the WordNetLemmatizer.lemmatize()
code, we see return min(lemmas, key=len) if lemmas else word
:
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
So that means the function will return the output from wn._morphy()
with the minimum length. But in this case both saw and see has the same length so the first on the list returned by wn._morphy()
will be the returned, i.e. saw
.
Effectively, the WordNetLemmatizer.lemmatize()
is doing this:
>>> from nltk.corpus import wordnet as wn
>>> wn._morphy('saw', 'v')
['saw', u'see']
>>> min(wn._morphy('saw', 'v'), key=len)
'saw'
So the question is:
But note that it's not exactly a "bug" but a "feature" to represent other possible lemmas of a surface word (although that word in that specific context is rare, e.g. I saw the log into half
.
How can I avoid this "bug" in NLTK?
To avoid this "bug" in NLTK, use nltk.wordnet._morphy()
instead of nltk.stem.WordNetLemmatizer.lemmatize()
that way you will always get a list of possible lemmas instead of the lemma that is filtered by length. To lemmatize:
>>> from nltk.corpus import wordnet as wn
>>> exceptions = wn._exception_map['v']
>>> wn._morphy('saw', pos='v')
['saw', 'see']
More choice is better than a wrong choice.
How to fix this "bug" in NLTK?
Other than the min(lemmas, key=len)
being sub-optimal, the _morphy()
function is a little inconsistent when dealing with exceptions because of rare meaning in the plural words that might be a lemma by itself, e.g. using teeth
to refer to dentures, see http://wordnetweb.princeton.edu/perl/webwn?s=teeth
>>> wn._morphy('teeth', 'n')
['teeth', u'tooth']
>>> wn._morphy('goose', 'n')
['goose']
>>> wn._morphy('geese', 'n')
[u'goose']
So the error in lemma choices must have been introduced in the nltk.wordnet._morphy()
function after the exception list. One quick hack is to immediately return the first instance of the exception list if the input surface word occurs in the exception list, e.g.:
from nltk.corpus import wordnet as wn
def _morphy(word, pos):
exceptions = wn._exception_map[pos]
if word in exceptions:
return exceptions[word]
# Else, continue the rest of the _morphy code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With