Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?
corpus import stopwords stop = list(stopwords. words('english')) stop. extend('myword1 myword2 myword3'. split()) vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop)) vectors = vectorizer.
From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.
Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).
According to the source code for sklearn.feature_extraction.text
, the full list (actually a frozenset
, from stop_words
) of ENGLISH_STOP_WORDS
is exposed through __all__
. Therefore if you want to use that list plus some more items, you could do something like:
from sklearn.feature_extraction import text stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
(where my_additional_stop_words
is any sequence of strings) and use the result as the stop_words
argument. This input to CountVectorizer.__init__
is parsed by _check_stop_list
, which will pass the new frozenset
straight through.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With