Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extend the stopword list from NLTK and remove stop words with the extended list?

I have tried two ways of removing stopwords, both of which I run into issues:

Method 1:

cachedStopWords = stopwords.words("english")
words_to_remove = """with some your just have from it's /via & that they your there this into providing would can't"""
remove = tu.removal_set(words_to_remove, query)
remove2 = tu.removal_set(cachedStopWords, query)

In this case, only the first remove function works. remove2 doesn't work.

Method 2:

lines = tu.lines_cleanup([sentence for sentence in sentence_list], remove=remove)
words = '\n'.join(lines).split()
print words # list of words

output looks like this ["Hello", "Good", "day"]

I try to remove stopwords from words. This is my code:

for word in words:
    if word in cachedStopwords:
        continue
    else:
        new_words='\n'.join(word)

print new_words

The output looks like this:

H
e
l
l
o

Cant figure out what is wrong with the above 2 methods. Please advice.

like image 672
jxn Avatar asked Mar 26 '15 09:03

jxn


People also ask

How do you remove stop words with NLTK?

NLTK supports stop word removal, and you can find the list of stop words in the corpus module. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK.

What is Stopword removal?

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.

What is Stopword in NLTK?

Stop words are words that are so common they are basically ignored by typical tokenizers. By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc. The stopwords in nltk are the most common words in data.


2 Answers

Use this for increasing the stopword list :

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
print(len(stop_words))

Output:

179

184

like image 97
Akash Kandpal Avatar answered Nov 15 '22 03:11

Akash Kandpal


I think what you want to achieve is to extend the list of stopwords from NLTK. Since the stopwords in NLTK are kept in a single list, you can simply do this:

>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words('english')
>>> stoplist
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
>>> more_stopwords = """with some your just have from it's /via & that they your there this into providing would can't"""
>>> stoplist += more_stopwords.split()
>>> sent = "With some of hacks to your line of code , we can simply extract the data you need ."
>>> sent_with_no_stopwords = [word for word in sent.split() if word not in stoplist]
>>> sent_with_no_stopwords
['With', 'hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
# Note that the "With" is different from "with".
# So let's try this:
>>> sent_with_no_stopwords = [word for word in sent.lower().split() if word not in stoplist]
>>> sent_with_no_stopwords
['hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
# To get it back into a string:
>>> new_sent = " ".join(sent_with_no_stopwords)
>>> new_sent
'hacks line code , simply extract data need .'
like image 21
alvas Avatar answered Nov 15 '22 05:11

alvas