Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are these words considered stopwords?

I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package:

In [80]: nltk.corpus.stopwords.words('english')

Out[80]:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

What I don't understand is, why is the word "not" present? Isn't that necessary to determine the sentiment inside a sentence? For instance, a sentence like this:

I am not sure what the problem is.

is totally different once the stopword not is removed changing the meaning of the sentence to its opposite (I am sure what the problem is). If that is the case, is there a set of rules that I am missing on when not to use these stopwords?

like image 666
Legend Avatar asked Jun 26 '11 03:06

Legend


People also ask

Why are stop words called stop words?

Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant.

Which is consider as a stop word?

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

What words are stop words?

We use stop words all the time, whether we're online or in our everyday lives. These are the articles, prepositions, and phrases that connect keywords together and help us form complete, coherent sentences. Common words like its, an, the, for, and that, are all considered stop words.

What does Stopwords mean in Python?

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.


1 Answers

The concept of stop word list does not have a universal meaning and depends on what you want to do. If you have a task where you need to understand the polarity, sentiment or a similar characteristic of a phrase and if your method depends on detecting negation (like in your example), obviously you shouldn't remove "not" as a stop word (note that you may still want to remove other very common unrelated words which would constitute your new stop word list).

However, to answer your question, most of the sentiment analysis methods are very superficial. They look for emotion/sentiment-laden words, and -- most of the time -- they do not attempt a deep analysis of the sentence.

As an another example where you would like to keep the stop words: if you are trying to classify the documents according to their authors (authorship attribution) or carrying out stylometrics, you should definitely keep these functional words as they characterize a big part of the style and the discourse.

However, for many other kinds of analyses (e.g. word space models, document similarity, search, etc.) removing very common, functional words makes sense both computationally (you process fewer words) and in some cases practically (you may even get better results with the stop words removed). If I'm trying to understand the context in which a specific word is used very often, I'd like to see the content words, not the functional words.

like image 198
Ruggiero Spearman Avatar answered Oct 05 '22 18:10

Ruggiero Spearman