Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which word stemmer should I use in nltk?

My goal is to analyze some corpus (twitter for the now) for emotional content. Just today I realized it would make a bit of sense to search for word stems as opposed to having an exhaustive list of emotional word stems. And so I've been exploring nltk.stem only to realize that there are 4 different stemmers. I'd like to ask the stackoverflow linguists whether LancasterStemmer, PorterStemmer, RegexpStemmer, RSLPStemmer, or WordNetStemmer is best preferably with some justification.

like image 306
speciousfool Avatar asked Aug 12 '09 08:08

speciousfool


People also ask

Which is the best Stemmer in NLTK?

The 'english' stemmer is better than the original 'porter' stemmer. Extra stemmer tests can be found in nltk. test. unit.

Which stemming algorithm is best?

Snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer.

Should I stem or Lemmatize?

Stemming is a faster process than lemmatization as stemming chops off the word irrespective of the context, whereas the latter is context-dependent. Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach. Lemmatization has higher accuracy than stemming.

Which NLTK package can be used for stemming?

nltk. stem is a package that performs stemming using different classes.


1 Answers

It may be a bit different than you are asking, but the Nodebox Lingustics library contains an is_emotive() function which seems to check words to see if they are recursive hyponyms of certain emotional words. From commonsense.py

    ekman = ["anger", "disgust", "fear", "joy", "sadness", "surprise"]
    other = ["emotion", "feeling", "expression"]

Not a stemmer, but an interesting approach to check out.

like image 123
tomcat23 Avatar answered Sep 20 '22 06:09

tomcat23