I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code <pre class="prettyprint"><code>from sklearn.feature_extraction import text my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words) vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000) X= vectorizer.fit_transform(text) </code></pre> I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.

This is how you can do it: <pre class="prettyprint"><code>from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"]) vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words) X = vectorizer.fit_transform(["this is an apple.","this is a book."]) idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) # printing the tfidf vectors print(X) # printing the vocabulary print(vectorizer.vocabulary_) </code></pre> In this example, I created the tfidf vectors for two sample documents: <pre class="prettyprint"><code>"This is a green apple." "This is a machine learning book." </code></pre> By default, <code>this</code>, <code>is</code>, <code>a</code>, and <code>an</code> are all in the <code>ENGLISH_STOP_WORDS</code> list. And, I also added <code>book</code> to the stop word list. This is the output: <pre class="prettyprint"><code>(0, 1) 0.707106781187 (0, 0) 0.707106781187 (1, 3) 0.707106781187 (1, 2) 0.707106781187 {'green': 1, 'machine': 3, 'learning': 2, 'apple': 0} </code></pre> As we can see, the word <code>book</code> is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.

adding words to stop_words list in TfidfVectorizer in sklearn

Tags:

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code

from sklearn.feature_extraction import text my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)  vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000) X= vectorizer.fit_transform(text)

I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.

677

asked Nov 09 '14 07:11

ac11

1 Answers

This is how you can do it:

from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer  my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])  vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)  X = vectorizer.fit_transform(["this is an apple.","this is a book."])  idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))  # printing the tfidf vectors print(X)  # printing the vocabulary print(vectorizer.vocabulary_)

In this example, I created the tfidf vectors for two sample documents:

"This is a green apple." "This is a machine learning book."

By default, this, is, a, and an are all in the ENGLISH_STOP_WORDS list. And, I also added book to the stop word list. This is the output:

(0, 1)  0.707106781187 (0, 0)  0.707106781187 (1, 3)  0.707106781187 (1, 2)  0.707106781187 {'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

As we can see, the word book is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.

146

answered Sep 18 '22 18:09

Pedram

Related questions
                            
                                How to add and change fonts in reveal.js?
                            
                                How to manage distance between nodes in graphviz?
                            
                                How to get task name inside task in gulp
                            
                                Why is "use strict" still a string literal? [duplicate]
                            
                                Run JUnit test in IntelliJ IDEA 14 without choosing configuration type
                            
                                import matplotlib.pyplot gives ImportError: dlopen(…) Library not loaded libpng15.15.dylib
                            
                                Sharing precompiled assets across docker containers
                            
                                package-refresh-contents hangs at Contacting host: elpa.gnu.org:80
                            
                                Why does `defer recover()` not catch panics?
                            
                                RecyclerView onCreateViewHolder called excessively when scrolling fast with DPAD
                            
                                Delaunay Triangulation of points from 2D surface in 3D with python?
                            
                                Scraping a dynamic ecommerce page with infinite scroll

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With