Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop NLTK stemmer from removing the trailing "e"?

Tags:

python

nlp

nltk

I'm using NLTK stemmer to remove grammatical variations of a stem word. However, the Port or Snowball stemmers remove the trailing "e" of the original form of a noun or verb, e.g., Profile becomes Profil.

How can I prevent this from happening? I know I can use a conditional to guard against this. But obviously it will fail on different cases.

Is there an option or another API for what I want?

like image 832
kakyo Avatar asked Jul 01 '14 19:07

kakyo


People also ask

Which is the best Stemmer in NLTK?

Porter's Stemmer is one of the most used stemming techniques that one can use in Natural Language Processing but as it's been almost 30 years since it's first implementation and development, Martin Porter developed an updated version called Porter2 that is also commonly called Snowball Stemmer due to it's nltk ...

What is stemming in NLTK?

Stemming with Python nltk package. "Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."

What is NLTK Lancasterstemmer?

stem. lancaster module. A word stemmer based on the Lancaster (Paice/Husk) stemming algorithm.

What is stemming in NLp?

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).


1 Answers

I agree with Philip that the goal of stemmer is to retain only the stem. For this particular case you can try a lemmatizer instead of stemmer which will supposedly retain more of a word and is meant to remove exactly different forms of a word like 'profiles' --> 'profile'. There is a class in NLTK for this - try WordNetLemmatizer() from nltk.stem.

Beware that it's still not perfect (like nothing when working with text) because I used to get 'physic' from 'physics'.

like image 52
Everst Avatar answered Nov 14 '22 22:11

Everst