Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python natural language processing stop words [duplicate]

I am just doing some research into NLP with Python and I have identified something strange.

On review of the following negative tweets:

neg_tweets = [('I do not like this car', 'negative'),
          ('This view is horrible', 'negative'),
          ('I feel tired this morning', 'negative'),
          ('I am not looking forward to the concert', 'negative'),<---
          ('He is my enemy', 'negative')]

And with some processing by removing stop words.

clean_data = []
stop_words = set(stopwords.words("english"))

for (words, sentiment) in pos_tweets + neg_tweets:
words_filtered = [e.lower() for e in words.split() if e not in stop_words]
clean_data.append((words_filtered, sentiment))

Part of the output is:

 (['i', 'looking', 'forward', 'concert'], 'negative')

I'm struggling to understand why the stop words include 'not' which can affect the sentiment of a tweet.

My understanding is that stop words have no value in terms of sentiment.

So, My question is why is 'not' included in the stopwords list?

like image 954
Andrew Daly Avatar asked Jun 27 '17 15:06

Andrew Daly


1 Answers

Stopwords in a sentence are "generally" of little or no use. As said by Stanford NLP group:

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words

Why the word "not"? : Simply because it appears very often in the english vocabulary, and is "usually" of little or no importance, for example if you are doing text summarization where these stopwords are of little to no use and it is all determined by the frequency distribution of words(like tf-idf.

So what can you do? Well, this is a very broad topic known as Negation Handling. It is a very broad area with many different methods. One of my favorite ones is to simply append preceding or succeeding negation clauses, before removing the stopwords or calculating word vectors. For example, you can convert not looking to not_looking which when computed upon and converted to vector space will be quite different. You can find a code for doing something similar in an SO answer here.

I hope this helps!

like image 57
Rudresh Panchal Avatar answered Sep 19 '22 14:09

Rudresh Panchal