Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding words to scikit-learn's CountVectorizer's stop list

Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?

like image 711
statsNoob Avatar asked Jun 24 '14 12:06

statsNoob


People also ask

How do you add stop words to TfidfVectorizer?

corpus import stopwords stop = list(stopwords. words('english')) stop. extend('myword1 myword2 myword3'. split()) vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop)) vectors = vectorizer.

Does TfidfVectorizer remove stop words?

From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.

Is CountVectorizer same as bag of words?

Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).


1 Answers

According to the source code for sklearn.feature_extraction.text, the full list (actually a frozenset, from stop_words) of ENGLISH_STOP_WORDS is exposed through __all__. Therefore if you want to use that list plus some more items, you could do something like:

from sklearn.feature_extraction import text   stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words) 

(where my_additional_stop_words is any sequence of strings) and use the result as the stop_words argument. This input to CountVectorizer.__init__ is parsed by _check_stop_list, which will pass the new frozenset straight through.

like image 152
jonrsharpe Avatar answered Sep 22 '22 17:09

jonrsharpe