Stopword removal with NLTK

Tags:

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.

721

asked Oct 02 '13 05:10

Grahesh Parkar

2 Answers

There is an in-built stopword list in NLTK made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html

>>> from nltk import word_tokenize >>> from nltk.corpus import stopwords >>> stop = set(stopwords.words('english')) >>> sentence = "this is a foo bar sentence" >>> print([i for i in sentence.lower().split() if i not in stop]) ['foo', 'bar', 'sentence'] >>> [i for i in word_tokenize(sentence.lower()) if i not in stop]  ['foo', 'bar', 'sentence']

I recommend looking at using tf-idf to remove stopwords, see Effects of Stemming on the term frequency?

125

answered Oct 18 '22 03:10

alvas

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not')) stop = set(stopwords...) - operators

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:     # use word

answered Oct 18 '22 05:10

otus

Related questions
                            
                                How to qcut with non unique bin edges?
                            
                                django modifying the request object
                            
                                Python function pointer
                            
                                Why don't my south migrations work?
                            
                                Trying to parse `request.body` from POST in Django [duplicate]
                            
                                When and how to use the builtin function property() in python
                            
                                Opposite of any() function
                            
                                Delete multiple files matching a pattern
                            
                                Getting error ImportMismatchError while running py.test
                            
                                Fuzzy String Comparison
                            
                                How to install xgboost in Anaconda Python (Windows platform)?
                            
                                How to get day name from datetime
                            
                                How to read html from a url in python 3
                            
                                Change figure size and figure format in matplotlib [duplicate]
                            
                                How to sort alpha numeric set in python
                            
                                Ordinal numbers replacement
                            
                                How to save a list to a file and read it as a list type?
                            
                                move column in pandas dataframe
                            
                                Mapping a range of values to another
                            
                                How to left align a fixed width string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stopword removal with NLTK

Tags:

python

nlp

nltk

stop-words

Grahesh Parkar

People also ask

2 Answers

alvas

otus

Recent Activity

Donate For Us