Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to remove stop words in Python

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords text = 'hello bye the the hi' text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))]) 

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

Note: I tried someone's suggest of wrapping stopwords.words('english') with set() but that made no difference.

Thank you.

like image 526
mchangun Avatar asked Oct 24 '13 08:10

mchangun


People also ask

How do you get rid of stop words in Python?

Using Python's Gensim Library All you have to do is to import the remove_stopwords() method from the gensim. parsing. preprocessing module. Next, you need to pass your sentence from which you want to remove stop words, to the remove_stopwords() method which returns text string without the stop words.

How do you remove stop words and punctuation in Python?

In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.


2 Answers

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords      cachedStopWords = stopwords.words("english")      def testFuncOld():         text = 'hello bye the the hi'         text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])      def testFuncNew():         text = 'hello bye the the hi'         text = ' '.join([word for word in text.split() if word not in cachedStopWords])      if __name__ == "__main__":         for i in xrange(10000):             testFuncOld()             testFuncNew() 

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

like image 163
Andy Rimmer Avatar answered Oct 08 '22 10:10

Andy Rimmer


Use a regexp to remove all words which do not match:

import re pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*') text = pattern.sub('', text) 

This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

like image 27
Alfe Avatar answered Oct 08 '22 09:10

Alfe