Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK words lemmatizing

I am trying to do lemmatization on words with NLTK.

What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement".

When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer(), it returns "acknowledg" rather than "acknowledge".

Can anyone tell me how to eliminate the affixes of words?
Say, when input is "acknowledgement", the output to be "acknowledge"

like image 314
noben Avatar asked Feb 16 '23 22:02

noben


1 Answers

Lemmatization does not (and should not) return "acknowledge" for "acknowledgement". The former is a verb, while the latter is a noun. Porter's stemming algorithm, on the other hand, simply uses a fixed set of rules. So, your only way there is to change the rules at source. (NOT the right way to fix your problem).

What you are looking for is the derivationally related form of "acknowledgement", and for this, your best source is WordNet. You can check this online on WordNet.

There are quite a few WordNet-based libraries that you can use for this (e.g. in JWNL in Java). In Python, NLTK should be able to get the derivationally related form you saw online:

from nltk.corpus import wordnet as wn

acknowledgment_synset = wn.synset('acknowledgement.n.01')
acknowledgment_lemma = acknowledgment_synset.lemmas[1]

print(acknowledgment_lemma.derivationally_related_forms())
# [Lemma('admit.v.01.acknowledge'), Lemma('acknowledge.v.06.acknowledge')]
like image 150
Chthonic Project Avatar answered Feb 18 '23 13:02

Chthonic Project