Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best stemming method in Python?

I tried all the nltk methods for stemming but it gives me weird results with some words.

Examples

It often cut end of words when it shouldn't do it :

  • poodle => poodl
  • article articl

or doesn't stem very good :

  • easily and easy are not stemmed in the same word
  • leaves, grows, fairly are not stemmed

Do you know other stemming libs in python, or a good dictionary?

Thank you

like image 427
PeYoTlL Avatar asked Jul 09 '14 07:07

PeYoTlL


People also ask

Which is better Lemmatization vs stemming?

Instead, lemmatization provides better results by performing an analysis that depends on the word's part-of-speech and producing real, dictionary words. As a result, lemmatization is harder to implement and slower compared to stemming.

What is the difference between Porter stemmer and snowball Stemmer?

Difference Between Porter Stemmer and Snowball Stemmer: There is only a little difference in the working of these two. Words like 'fairly' and 'sportingly' were stemmed to 'fair' and 'sport' in the snowball stemmer but when you use the porter stemmer they are stemmed to 'fairli' and 'sportingli'.


2 Answers

The results you are getting are (generally) expected for a stemmer in English. You say you tried "all the nltk methods" but when I try your examples, that doesn't seem to be the case.

Here are some examples using the PorterStemmer

import nltk ps = nltk.stemmer.PorterStemmer() ps.stem('grows') 'grow' ps.stem('leaves') 'leav' ps.stem('fairly') 'fairli' 

The results are 'grow', 'leav' and 'fairli' which, even if they are what you wanted, are stemmed versions of the original word.

If we switch to the Snowball stemmer, we have to provide the language as a parameter.

import nltk sno = nltk.stem.SnowballStemmer('english') sno.stem('grows') 'grow' sno.stem('leaves') 'leav' sno.stem('fairly') 'fair' 

The results are as before for 'grows' and 'leaves' but 'fairly' is stemmed to 'fair'

So in both cases (and there are more than two stemmers available in nltk), words that you say are not stemmed, in fact, are. The LancasterStemmer will return 'easy' when provided with 'easily' or 'easy' as input.

Maybe you really wanted a lemmatizer? That would return 'article' and 'poodle' unchanged.

import nltk lemma = nltk.wordnet.WordNetLemmatizer() lemma.lemmatize('article') 'article' lemma.lemmatize('leaves') 'leaf' 
like image 62
Spaceghost Avatar answered Sep 28 '22 22:09

Spaceghost


All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as

In [3]: from nltk.stem.porter import *  In [4]: stemmer = PorterStemmer()  In [5]: stemmer.stem('identified') Out[5]: u'identifi'  In [6]: stemmer.stem('nonsensical') Out[6]: u'nonsens' 

To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here

>>> import hunspell >>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff') >>> hobj.spell('spookie') False >>> hobj.suggest('spookie') ['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill'] >>> hobj.spell('spooky') True >>> hobj.analyze('linked') [' st:link fl:D'] >>> hobj.stem('linked') ['link'] 
like image 41
0xF Avatar answered Sep 28 '22 22:09

0xF