Faster way to remove stop words in Python

Tags:

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords text = 'hello bye the the hi' text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

Note: I tried someone's suggest of wrapping stopwords.words('english') with set() but that made no difference.

Thank you.

526

asked Oct 24 '13 08:10

mchangun

2 Answers

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords      cachedStopWords = stopwords.words("english")      def testFuncOld():         text = 'hello bye the the hi'         text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])      def testFuncNew():         text = 'hello bye the the hi'         text = ' '.join([word for word in text.split() if word not in cachedStopWords])      if __name__ == "__main__":         for i in xrange(10000):             testFuncOld()             testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

163

answered Oct 08 '22 10:10

Andy Rimmer

Use a regexp to remove all words which do not match:

import re pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*') text = pattern.sub('', text)

This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

answered Oct 08 '22 09:10

Alfe

Related questions
                            
                                How can I install the Beautiful Soup module on the Mac?
                            
                                Making Python's `assert` throw an exception that I choose
                            
                                Installing h5py on an Ubuntu server
                            
                                How to give delay between each requests in scrapy?
                            
                                File "/usr/bin/pip", line 9, in <module> from pip import main ImportError: cannot import name main
                            
                                Python Observer Pattern: Examples, Tips? [closed]
                            
                                Second y-axis label getting cut off
                            
                                How to convert bytes type to dictionary?
                            
                                Find out how many times a regex matches in a string in Python
                            
                                multithreaded blas in python/numpy
                            
                                How to calculate difference between two dates in weeks in python
                            
                                Making a list of evenly spaced numbers in a certain range in python
                            
                                How to convert PIL Image.image object to base64 string? [duplicate]
                            
                                _corrupt_record error when reading a JSON file into Spark
                            
                                Iterating through list of list in Python
                            
                                Pandas Timedelta in Days
                            
                                matplotlib: RuntimeError: Python is not installed as a framework
                            
                                Create 3D array using Python
                            
                                How can I concatenate a string and a number in Python? [duplicate]
                            
                                nginx and supervisor setup in Ubuntu

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Faster way to remove stop words in Python

Tags:

python

regex

stop-words

mchangun

People also ask

2 Answers

Andy Rimmer

Alfe

Recent Activity

Donate For Us