I am trying to remove stopwords from a string of text:
from nltk.corpus import stopwords text = 'hello bye the the hi' text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])
I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub
but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.
Note: I tried someone's suggest of wrapping stopwords.words('english')
with set()
but that made no difference.
Thank you.
Using Python's Gensim Library All you have to do is to import the remove_stopwords() method from the gensim. parsing. preprocessing module. Next, you need to pass your sentence from which you want to remove stop words, to the remove_stopwords() method which returns text string without the stop words.
In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.
Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.
from nltk.corpus import stopwords cachedStopWords = stopwords.words("english") def testFuncOld(): text = 'hello bye the the hi' text = ' '.join([word for word in text.split() if word not in stopwords.words("english")]) def testFuncNew(): text = 'hello bye the the hi' text = ' '.join([word for word in text.split() if word not in cachedStopWords]) if __name__ == "__main__": for i in xrange(10000): testFuncOld() testFuncNew()
I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.
nCalls Cumulative Time
10000 7.723 words.py:7(testFuncOld)
10000 0.140 words.py:11(testFuncNew)
So, caching the stopwords instance gives a ~70x speedup.
Use a regexp to remove all words which do not match:
import re pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*') text = pattern.sub('', text)
This will probably be way faster than looping yourself, especially for large input strings.
If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With