Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

this page: http://scikit-learn.org/stable/modules/feature_extraction.html mentions:

As tf–idf is a very often used for text features, there is also another class called TfidfVectorizer that combines all the option of CountVectorizer and TfidfTransformer in a single model.

then I followed the code and use fit_transform() on my corpus. How to get the weight of each feature computed by fit_transform()?

I tried:

In [39]: vectorizer.idf_ --------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last) <ipython-input-39-5475eefe04c0> in <module>() ----> 1 vectorizer.idf_  AttributeError: 'TfidfVectorizer' object has no attribute 'idf_' 

but this attribute is missing.

Thanks

like image 752
fast tooth Avatar asked May 21 '14 20:05

fast tooth


People also ask

What does Sklearn TfidfVectorizer do?

Scikit-learn's CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

What is Sklearn feature_extraction?

The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

How does Sklearn calculate TF-IDF?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

Which is better CountVectorizer or TfidfVectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.


1 Answers

Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_ of the TfidfVectorizer object:

from sklearn.feature_extraction.text import TfidfVectorizer corpus = ["This is very strange",           "This is very nice"] vectorizer = TfidfVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) idf = vectorizer.idf_ print dict(zip(vectorizer.get_feature_names(), idf)) 

Output:

{u'is': 1.0,  u'nice': 1.4054651081081644,  u'strange': 1.4054651081081644,  u'this': 1.0,  u'very': 1.0} 

As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_ via the supposedly hidden _tfidf (an instance of TfidfTransformer) of the vectorizer:

idf = vectorizer._tfidf.idf_ print dict(zip(vectorizer.get_feature_names(), idf)) 

which should give the same output as above.

like image 183
YS-L Avatar answered Sep 20 '22 14:09

YS-L