Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

adding words to stop_words list in TfidfVectorizer in sklearn

Tags:

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code

from sklearn.feature_extraction import text my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)  vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000) X= vectorizer.fit_transform(text) 

I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.

like image 677
ac11 Avatar asked Nov 09 '14 07:11

ac11


People also ask

Does TfidfVectorizer remove stop words?

From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.

What is the difference between TfidfVectorizer and TfidfTransformer?

The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.

What is the difference between CountVectorizer and TfidfVectorizer?

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.

How does TF-IDF Vectorizer work?

TFIDF works by proportionally increasing the number of times a word appears in the document but is counterbalanced by the number of documents in which it is present. Hence, words like 'this', 'are' etc., that are commonly present in all the documents are not given a very high rank.


1 Answers

This is how you can do it:

from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer  my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])  vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)  X = vectorizer.fit_transform(["this is an apple.","this is a book."])  idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))  # printing the tfidf vectors print(X)  # printing the vocabulary print(vectorizer.vocabulary_) 

In this example, I created the tfidf vectors for two sample documents:

"This is a green apple." "This is a machine learning book." 

By default, this, is, a, and an are all in the ENGLISH_STOP_WORDS list. And, I also added book to the stop word list. This is the output:

(0, 1)  0.707106781187 (0, 0)  0.707106781187 (1, 3)  0.707106781187 (1, 2)  0.707106781187 {'green': 1, 'machine': 3, 'learning': 2, 'apple': 0} 

As we can see, the word book is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.

like image 146
Pedram Avatar answered Sep 18 '22 18:09

Pedram