I tried all the nltk methods for stemming but it gives me weird results with some words. Examples It often cut end of words when it shouldn't do it : <ul> <li>poodle => poodl</li> <li>article articl</li> </ul> or doesn't stem very good : <ul> <li>easily and easy are not stemmed in the same word</li> <li>leaves, grows, fairly are not stemmed</li> </ul> Do you know other stemming libs in python, or a good dictionary? Thank you

The results you are getting are (generally) expected for a stemmer in English. You say you tried "all the nltk methods" but when I try your examples, that doesn't seem to be the case. Here are some examples using the PorterStemmer <pre class="prettyprint"><code>import nltk ps = nltk.stemmer.PorterStemmer() ps.stem('grows') 'grow' ps.stem('leaves') 'leav' ps.stem('fairly') 'fairli' </code></pre> The results are 'grow', 'leav' and 'fairli' which, even if they are what you wanted, are stemmed versions of the original word. If we switch to the Snowball stemmer, we have to provide the language as a parameter. <pre class="prettyprint"><code>import nltk sno = nltk.stem.SnowballStemmer('english') sno.stem('grows') 'grow' sno.stem('leaves') 'leav' sno.stem('fairly') 'fair' </code></pre> The results are as before for 'grows' and 'leaves' but 'fairly' is stemmed to 'fair' So in both cases (and there are more than two stemmers available in nltk), words that you say are not stemmed, in fact, are. The LancasterStemmer will return 'easy' when provided with 'easily' or 'easy' as input. Maybe you really wanted a lemmatizer? That would return 'article' and 'poodle' unchanged. <pre class="prettyprint"><code>import nltk lemma = nltk.wordnet.WordNetLemmatizer() lemma.lemmatize('article') 'article' lemma.lemmatize('leaves') 'leaf' </code></pre>

What is the best stemming method in Python?

2 Answers

The results you are getting are (generally) expected for a stemmer in English. You say you tried "all the nltk methods" but when I try your examples, that doesn't seem to be the case.

Here are some examples using the PorterStemmer

import nltk ps = nltk.stemmer.PorterStemmer() ps.stem('grows') 'grow' ps.stem('leaves') 'leav' ps.stem('fairly') 'fairli'

The results are 'grow', 'leav' and 'fairli' which, even if they are what you wanted, are stemmed versions of the original word.

If we switch to the Snowball stemmer, we have to provide the language as a parameter.

import nltk sno = nltk.stem.SnowballStemmer('english') sno.stem('grows') 'grow' sno.stem('leaves') 'leav' sno.stem('fairly') 'fair'

The results are as before for 'grows' and 'leaves' but 'fairly' is stemmed to 'fair'

So in both cases (and there are more than two stemmers available in nltk), words that you say are not stemmed, in fact, are. The LancasterStemmer will return 'easy' when provided with 'easily' or 'easy' as input.

Maybe you really wanted a lemmatizer? That would return 'article' and 'poodle' unchanged.

import nltk lemma = nltk.wordnet.WordNetLemmatizer() lemma.lemmatize('article') 'article' lemma.lemmatize('leaves') 'leaf'

answered Sep 28 '22 22:09

Spaceghost

All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as

In [3]: from nltk.stem.porter import *  In [4]: stemmer = PorterStemmer()  In [5]: stemmer.stem('identified') Out[5]: u'identifi'  In [6]: stemmer.stem('nonsensical') Out[6]: u'nonsens'

To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here

>>> import hunspell >>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff') >>> hobj.spell('spookie') False >>> hobj.suggest('spookie') ['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill'] >>> hobj.spell('spooky') True >>> hobj.analyze('linked') [' st:link fl:D'] >>> hobj.stem('linked') ['link']

answered Sep 28 '22 22:09

0xF

Related questions
                            
                                Is there a way to know by which Python version the .pyc file was compiled?
                            
                                how to find the owner of a file or directory in python
                            
                                How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?
                            
                                Get business days between start and end date using pandas
                            
                                Calculating percentage of Bounding box overlap, for image detector evaluation
                            
                                Unable to pass jinja2 variables into javascript snippet
                            
                                django can't find new sqlite version? (SQLite 3.8.3 or later is required (found 3.7.17))
                            
                                How do I get a python module's version number through code? [duplicate]
                            
                                Reverse a string without using reversed() or [::-1]?
                            
                                Instance of 'SQLAlchemy' has no 'Column' member (no-member)
                            
                                Are one-line 'if'/'for'-statements good Python style?
                            
                                How to Split Image Into Multiple Pieces in Python
                            
                                What "tools" are available in Python standard library [closed]
                            
                                How to stop celery worker process
                            
                                Manage empty list/invalid input when finding max/min value of list (Python)
                            
                                How to suppress pip upgrade warning?
                            
                                Shortest way of creating an object with arbitrary attributes in Python?
                            
                                Convert string into Date type on Python [duplicate]
                            
                                error: could not create '/Library/Python/2.7/site-packages/xlrd': Permission denied
                            
                                How do you alias a python class to have another name without using inheritance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the best stemming method in Python?

Tags:

python

nltk

stemming

PeYoTlL

People also ask

2 Answers

Spaceghost

0xF

Recent Activity

Donate For Us