Stemming unstructured text in NLTK

Tags:

I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with:

import nltk
from nltk.book import *
f = open('tupac_original.txt', 'rU')
text = f.read()
text1 = text.split()
tup = nltk.Text(text1)
lowtup = [w.lower() for w in tup if w.isalpha()]
import sys, re
tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')]
from nltk import stem
tupstem = stem.RegexpStemmer('az$|as$|a$')
[tupstem.stem(i) for i in tupclean]

The result of the above is;

['like', 'ed', 'young', 'black', 'like'...]

I'm trying to clean up .txt files (all lowercase, remove stopwords, etc), normalize multiple spellings of a word into one and do a frequency dist/count. I know how to do FreqDist, but any suggestions as to where I'm going wrong with the stemming?

715

asked Sep 26 '13 18:09

user2221429

1 Answers

There are several pre-coded well-known stemmers in NLTK, see http://nltk.org/api/nltk.stem.html and below shows an example.

>>> from nltk import stem
>>> porter = stem.porter.PorterStemmer()
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> tokens =  ['player', 'playa', 'playas', 'pleyaz'] 
>>> [porter(i) for i in tokens]
>>> [porter.stem(i) for i in tokens]
['player', 'playa', 'playa', 'pleyaz']
>>> [lancaster.stem(i) for i in tokens]
['play', 'play', 'playa', 'pleyaz']
>>> [snowball.stem(i) for i in tokens]
[u'player', u'playa', u'playa', u'pleyaz']

But what you probably need is some sort of a regex stemmer,

>>> from nltk import stem
>>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$')
>>> [rxstem.stem(i) for i in tokens]
['play', 'play', 'play', 'pley']

139

answered Sep 28 '22 00:09

alvas

Related questions
                            
                                Finding the surrounding sentence of a char/word in a string
                            
                                getting hypernyms from wordnet through nltk python
                            
                                SnowballStemmer for Russian words list
                            
                                Creating relations in sentence using chunk tags (not NER) with NLTK | NLP
                            
                                NLTK: why does nltk not recognize the CLASSPATH variable for stanford-ner?
                            
                                NLTK: How do I traverse a noun phrase to return list of strings?
                            
                                Finding conditional probability of trigram in python nltk
                            
                                split sentence without space in python (nltk?)
                            
                                Python NLTK WUP Similarity Score not unity for exact same word
                            
                                While installing nltk package getting ModuleNotFoundError: No module named '_sqlite3'
                            
                                How can I untokenize a spacy.tokens.token.Token?
                            
                                NLTK/NLP buliding a many-to-many/multi-label subject classifier
                            
                                How to obtain better results using NLTK pos tag
                            
                                Which spam corpus I can use in NLTK?
                            
                                Python NLP British English vs American English
                            
                                How to choose a Feature Selection Algorithm? - advice
                            
                                Which classifier to choose in NLTK
                            
                                What features do NLP practitioners use to pick out English names?
                            
                                How does kmeans know how to cluster documents when we only feed it tfidf vectors of individual words?
                            
                                AttributeError: ‘module’ object has no attribute 'scores'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stemming unstructured text in NLTK

Tags:

tokenize

nltk

lemmatization

text-analysis

user2221429

People also ask

1 Answers

alvas

Recent Activity

Donate For Us