this page: http://scikit-learn.org/stable/modules/feature_extraction.html mentions:
As tf–idf is a very often used for text features, there is also another class called TfidfVectorizer that combines all the option of CountVectorizer and TfidfTransformer in a single model.
then I followed the code and use fit_transform() on my corpus. How to get the weight of each feature computed by fit_transform()?
I tried:
In [39]: vectorizer.idf_ --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-39-5475eefe04c0> in <module>() ----> 1 vectorizer.idf_ AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'
but this attribute is missing.
Thanks
Scikit-learn's CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.
The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.
Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_
of the TfidfVectorizer
object:
from sklearn.feature_extraction.text import TfidfVectorizer corpus = ["This is very strange", "This is very nice"] vectorizer = TfidfVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) idf = vectorizer.idf_ print dict(zip(vectorizer.get_feature_names(), idf))
Output:
{u'is': 1.0, u'nice': 1.4054651081081644, u'strange': 1.4054651081081644, u'this': 1.0, u'very': 1.0}
As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_
via the supposedly hidden _tfidf
(an instance of TfidfTransformer
) of the vectorizer:
idf = vectorizer._tfidf.idf_ print dict(zip(vectorizer.get_feature_names(), idf))
which should give the same output as above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With