I have tried two ways of removing stopwords, both of which I run into issues:
Method 1:
cachedStopWords = stopwords.words("english")
words_to_remove = """with some your just have from it's /via & that they your there this into providing would can't"""
remove = tu.removal_set(words_to_remove, query)
remove2 = tu.removal_set(cachedStopWords, query)
In this case, only the first remove function works. remove2 doesn't work.
Method 2:
lines = tu.lines_cleanup([sentence for sentence in sentence_list], remove=remove)
words = '\n'.join(lines).split()
print words # list of words
output looks like this ["Hello", "Good", "day"]
I try to remove stopwords from words. This is my code:
for word in words:
if word in cachedStopwords:
continue
else:
new_words='\n'.join(word)
print new_words
The output looks like this:
H
e
l
l
o
Cant figure out what is wrong with the above 2 methods. Please advice.
NLTK supports stop word removal, and you can find the list of stop words in the corpus module. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK.
Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.
Stop words are words that are so common they are basically ignored by typical tokenizers. By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc. The stopwords in nltk are the most common words in data.
Use this for increasing the stopword list :
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
print(len(stop_words))
Output:
179
184
I think what you want to achieve is to extend the list of stopwords from NLTK. Since the stopwords in NLTK are kept in a single list, you can simply do this:
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words('english')
>>> stoplist
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
>>> more_stopwords = """with some your just have from it's /via & that they your there this into providing would can't"""
>>> stoplist += more_stopwords.split()
>>> sent = "With some of hacks to your line of code , we can simply extract the data you need ."
>>> sent_with_no_stopwords = [word for word in sent.split() if word not in stoplist]
>>> sent_with_no_stopwords
['With', 'hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
# Note that the "With" is different from "with".
# So let's try this:
>>> sent_with_no_stopwords = [word for word in sent.lower().split() if word not in stoplist]
>>> sent_with_no_stopwords
['hacks', 'line', 'code', ',', 'simply', 'extract', 'data', 'need', '.']
# To get it back into a string:
>>> new_sent = " ".join(sent_with_no_stopwords)
>>> new_sent
'hacks line code , simply extract data need .'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With