Why are these words considered stopwords?

Tags:

I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package:

In [80]: nltk.corpus.stopwords.words('english')

Out[80]:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

What I don't understand is, why is the word "not" present? Isn't that necessary to determine the sentiment inside a sentence? For instance, a sentence like this:

I am not sure what the problem is.

is totally different once the stopword not is removed changing the meaning of the sentence to its opposite (I am sure what the problem is). If that is the case, is there a set of rules that I am missing on when not to use these stopwords?

666

asked Jun 26 '11 03:06

Legend

1 Answers

The concept of stop word list does not have a universal meaning and depends on what you want to do. If you have a task where you need to understand the polarity, sentiment or a similar characteristic of a phrase and if your method depends on detecting negation (like in your example), obviously you shouldn't remove "not" as a stop word (note that you may still want to remove other very common unrelated words which would constitute your new stop word list).

However, to answer your question, most of the sentiment analysis methods are very superficial. They look for emotion/sentiment-laden words, and -- most of the time -- they do not attempt a deep analysis of the sentence.

As an another example where you would like to keep the stop words: if you are trying to classify the documents according to their authors (authorship attribution) or carrying out stylometrics, you should definitely keep these functional words as they characterize a big part of the style and the discourse.

However, for many other kinds of analyses (e.g. word space models, document similarity, search, etc.) removing very common, functional words makes sense both computationally (you process fewer words) and in some cases practically (you may even get better results with the stop words removed). If I'm trying to understand the context in which a specific word is used very often, I'd like to see the content words, not the functional words.

198

answered Oct 05 '22 18:10

Ruggiero Spearman

Related questions
                            
                                How can I create an alphanumeric Regex for all languages?
                            
                                Text message (SMS) verification for signups
                            
                                How might a class like .NET's ConcurrentBag<T> be implemented?
                            
                                NFA minimization without determinization
                            
                                Fix bugs in library code, or abandom them?
                            
                                How unique is PHP's __autoload()?
                            
                                which sorting algorithms give near / approximate sort sooner?
                            
                                (start, end) vs. (start, length) in API design
                            
                                CMS without front end? [closed]
                            
                                Any research on maintainability of "guard statement" vs. "single function exit point" paradigm available?
                            
                                Detecting when matrix multiplication is possible
                            
                                Multiple Inheritance: What's a good example?
                            
                                What are the distinctions between lexical and static scoping?
                            
                                Should I represent database data with immutable or mutable data structures?
                            
                                What's a good way to rewrite this non-tail-recursive function?
                            
                                To monkey-patch or not to?
                            
                                Heuristic to identify if a series of 4 bytes chunks of data are integers or floats
                            
                                What are some web-based knowledge-base solutions? [closed]
                            
                                Practical application of "Bananas, Lenses, Envelopes, and Barbed Wire"?
                            
                                Qt Widget Overlays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are these words considered stopwords?

Tags:

language-agnostic

machine-learning

nlp

nltk

stop-words

Legend

People also ask

1 Answers

Ruggiero Spearman

Recent Activity

Donate For Us