Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stopword removal with NLTK

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.

like image 721
Grahesh Parkar Avatar asked Oct 02 '13 05:10

Grahesh Parkar


People also ask

What is Stopword in NLTK?

The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed.

What is Stopword removal?

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.

What is Stopword removal and stemming?

Stop word elimination and stemming are commonly used method in indexing. Stop words are high frequency words that have little semantic weight and are thus unlikely to help the retrieval process. Usual practice in IR is to drop them from index. Stemming conflates morphological variants of words in its root or stem.


2 Answers

There is an in-built stopword list in NLTK made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html

>>> from nltk import word_tokenize >>> from nltk.corpus import stopwords >>> stop = set(stopwords.words('english')) >>> sentence = "this is a foo bar sentence" >>> print([i for i in sentence.lower().split() if i not in stop]) ['foo', 'bar', 'sentence'] >>> [i for i in word_tokenize(sentence.lower()) if i not in stop]  ['foo', 'bar', 'sentence'] 

I recommend looking at using tf-idf to remove stopwords, see Effects of Stemming on the term frequency?

like image 125
alvas Avatar answered Oct 18 '22 03:10

alvas


I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not')) stop = set(stopwords...) - operators 

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:     # use word 
like image 23
otus Avatar answered Oct 18 '22 05:10

otus