I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code
from sklearn.feature_extraction import text my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words) vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000) X= vectorizer.fit_transform(text)
I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.
From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.
The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.
With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.
TFIDF works by proportionally increasing the number of times a word appears in the document but is counterbalanced by the number of documents in which it is present. Hence, words like 'this', 'are' etc., that are commonly present in all the documents are not given a very high rank.
This is how you can do it:
from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"]) vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words) X = vectorizer.fit_transform(["this is an apple.","this is a book."]) idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) # printing the tfidf vectors print(X) # printing the vocabulary print(vectorizer.vocabulary_)
In this example, I created the tfidf vectors for two sample documents:
"This is a green apple." "This is a machine learning book."
By default, this
, is
, a
, and an
are all in the ENGLISH_STOP_WORDS
list. And, I also added book
to the stop word list. This is the output:
(0, 1) 0.707106781187 (0, 0) 0.707106781187 (1, 3) 0.707106781187 (1, 2) 0.707106781187 {'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}
As we can see, the word book
is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With