How to extend the stopword list from NLTK and remove stop words with the extended list?

I have tried two ways of removing stopwords, both of which I run into issues:

Method 1:

cachedStopWords = stopwords.words("english")
words_to_remove = """with some your just have from it's /via &amp; that they your there this into providing would can't"""
remove = tu.removal_set(words_to_remove, query)
remove2 = tu.removal_set(cachedStopWords, query)

In this case, only the first remove function works. remove2 doesn't work.

Method 2:

lines = tu.lines_cleanup([sentence for sentence in sentence_list], remove=remove)
words = '\n'.join(lines).split()
print words # list of words

output looks like this ["Hello", "Good", "day"]

I try to remove stopwords from words. This is my code:

for word in words:
    if word in cachedStopwords:
        continue
    else:
        new_words='\n'.join(word)

print new_words

The output looks like this:

H
e
l
l
o

Cant figure out what is wrong with the above 2 methods. Please advice.

How do you remove stop words with NLTK?

NLTK supports stop word removal, and you can find the list of stop words in the corpus module. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK.

What is Stopword removal?

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.

What is Stopword in NLTK?

Stop words are words that are so common they are basically ignored by typical tokenizers. By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc. The stopwords in nltk are the most common words in data.

Use this for increasing the stopword list :

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
print(len(stop_words))

Output:

179

184

I think what you want to achieve is to extend the list of stopwords from NLTK. Since the stopwords in NLTK are kept in a single list, you can simply do this:

>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words('english')
>>> stoplist
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
>>> more_stopwords = """with some your just have from it's /via &amp; that they your there this into providing would can't"""
>>> stoplist += more_stopwords.split()
>>> sent = "With some of hacks to your line of code , we can simply extract the data you need ."
>>> sent_with_no_stopwords = [word for word in sent.split() if word not in stoplist]
>>> sent_with_no_stopwords
['With', 'hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
# Note that the "With" is different from "with".
# So let's try this:
>>> sent_with_no_stopwords = [word for word in sent.lower().split() if word not in stoplist]
>>> sent_with_no_stopwords
['hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
# To get it back into a string:
>>> new_sent = " ".join(sent_with_no_stopwords)
>>> new_sent
'hacks line code , simply extract data need .'

How to extend the stopword list from NLTK and remove stop words with the extended list?

Tags:

python

nlp

nltk

stop-words

jxn

People also ask

2 Answers

Akash Kandpal

alvas

Recent Activity

Donate For Us

How to extend the stopword list from NLTK and remove stop words with the extended list?

Tags:

python

nlp

nltk

stop-words

jxn

People also ask

2 Answers

Akash Kandpal

alvas

Related questions

Recent Activity

Donate For Us