I'm using NLTK stemmer to remove grammatical variations of a stem word. However, the Port or Snowball stemmers remove the trailing "e" of the original form of a noun or verb, e.g., Profile becomes Profil.
How can I prevent this from happening? I know I can use a conditional to guard against this. But obviously it will fail on different cases.
Is there an option or another API for what I want?
Porter's Stemmer is one of the most used stemming techniques that one can use in Natural Language Processing but as it's been almost 30 years since it's first implementation and development, Martin Porter developed an updated version called Porter2 that is also commonly called Snowball Stemmer due to it's nltk ...
Stemming with Python nltk package. "Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."
stem. lancaster module. A word stemmer based on the Lancaster (Paice/Husk) stemming algorithm.
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
I agree with Philip that the goal of stemmer is to retain only the stem. For this particular case you can try a lemmatizer instead of stemmer which will supposedly retain more of a word and is meant to remove exactly different forms of a word like 'profiles' --> 'profile'. There is a class in NLTK for this - try WordNetLemmatizer() from nltk.stem.
Beware that it's still not perfect (like nothing when working with text) because I used to get 'physic' from 'physics'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With