Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WordNet lemmatizer in NLTK: what is the correct lemma for "boss"?

I use nltk 3.0.4 and notice that lemmas for words boss and bosses are different.

from nltk.stem.wordnet import WordNetLemmatizer

wnl = WordNetLemmatizer()

print wnl.lemmatize("boss", "n")
# returns "bos"

print wnl.lemmatize("bosses", "n")
# returns "boss"

From my point of view it's a weird behavior especially that boss is a known word in WordNet and there is a rule to keep ss.

Does anyone have an explanation or this is just a bug? How I should deal with it?

like image 937
gakhov Avatar asked Oct 30 '22 21:10

gakhov


1 Answers

  1. After checking the code (_morphy()) that generates the possible analyses for a given word, I found that there is no rule included to keep ss.
  2. Bos is also a base form in wordnet.

Substitution rules:

MORPHOLOGICAL_SUBSTITUTIONS = {
    NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'),
           ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'),
           ('men', 'man'), ('ies', 'y')],
    VERB: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''),
           ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')],
    ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
    ADV: []}

Calling print wnl.lemmatize("boss", "n"):

Since a suitable base form (Bos) can be found when applying the substitution rules, it is returned. If this had not been included in wordnet the the lemma for boss would be boss since no shorter form can be found.

like image 174
b3000 Avatar answered Nov 15 '22 04:11

b3000