I use nltk
3.0.4 and notice that lemmas for words boss
and bosses
are different.
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()
print wnl.lemmatize("boss", "n")
# returns "bos"
print wnl.lemmatize("bosses", "n")
# returns "boss"
From my point of view it's a weird behavior especially that boss
is a known word in WordNet and there is a rule to keep ss
.
Does anyone have an explanation or this is just a bug? How I should deal with it?
_morphy()
) that generates the possible analyses for a given word, I found that there is no rule included to keep ss
.Bos
is also a base form in wordnet.Substitution rules:
MORPHOLOGICAL_SUBSTITUTIONS = {
NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'),
('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'),
('men', 'man'), ('ies', 'y')],
VERB: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''),
('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')],
ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
ADV: []}
Calling print wnl.lemmatize("boss", "n")
:
Since a suitable base form (Bos
) can be found when applying the substitution rules, it is returned. If this had not been included in wordnet the the lemma for boss
would be boss
since no shorter form can be found.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With