I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with:
import nltk
from nltk.book import *
f = open('tupac_original.txt', 'rU')
text = f.read()
text1 = text.split()
tup = nltk.Text(text1)
lowtup = [w.lower() for w in tup if w.isalpha()]
import sys, re
tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')]
from nltk import stem
tupstem = stem.RegexpStemmer('az$|as$|a$')
[tupstem.stem(i) for i in tupclean]
The result of the above is;
['like', 'ed', 'young', 'black', 'like'...]
I'm trying to clean up .txt
files (all lowercase, remove stopwords, etc), normalize multiple spellings of a word into one and do a frequency dist/count. I know how to do FreqDist
, but any suggestions as to where I'm going wrong with the stemming?
nltk. stem is a package that performs stemming using different classes.
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers.
Stemming is a faster process than lemmatization as stemming chops off the word irrespective of the context, whereas the latter is context-dependent. Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach. Lemmatization has higher accuracy than stemming.
Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.
There are several pre-coded well-known stemmers in NLTK
, see http://nltk.org/api/nltk.stem.html and below shows an example.
>>> from nltk import stem
>>> porter = stem.porter.PorterStemmer()
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> tokens = ['player', 'playa', 'playas', 'pleyaz']
>>> [porter(i) for i in tokens]
>>> [porter.stem(i) for i in tokens]
['player', 'playa', 'playa', 'pleyaz']
>>> [lancaster.stem(i) for i in tokens]
['play', 'play', 'playa', 'pleyaz']
>>> [snowball.stem(i) for i in tokens]
[u'player', u'playa', u'playa', u'pleyaz']
But what you probably need is some sort of a regex stemmer,
>>> from nltk import stem
>>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$')
>>> [rxstem.stem(i) for i in tokens]
['play', 'play', 'play', 'pley']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With