I run the following code to convert the text matrix to TF-IDF matrix. <pre class="prettyprint"><code>text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF'] from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None) X = vectorizer.fit_transform(text) X_vovab = vectorizer.get_feature_names() X_mat = X.todense() X_idf = vectorizer.idf_ </code></pre> I get the following output X_vovab = <pre class="prettyprint"><code>[u'calculation', u'computation', u'idf', u'product', u'string', u'tf', u'tfidf'] </code></pre> and X_mat = <pre class="prettyprint"><code> ([[ 0. , 0. , 0. , 0. , 1.51082562, 0. , 0. ], [ 0. , 0. , 0. , 0. , 1.51082562, 0. , 0. ], [ 1.91629073, 1.91629073, 0. , 0. , 0. , 0. , 1.51082562], [ 0. , 0. , 1.91629073, 1.91629073, 0. , 1.91629073, 1.51082562]]) </code></pre> Now I dont understand how these scores are computed. My idea is that for the text[0], score for only 'string' is computed and there is a score in the 5th coloumn. But as TF_IDF is the product of term frequency which is 2 and IDF which is log(4/2) is 1.39 and not 1.51 as shown in the matrix. How is the TF-IDF score calculated in scikit-learn.

TF-IDF is done in multiple steps by Scikit Learn's TfidfVectorizer, which in fact uses TfidfTransformer and inherits CountVectorizer. Let me summarize the steps it does to make it more straightforward: <ol> <li>tfs are calculated by CountVectorizer's fit_transform()</li> <li>idfs are calculated by TfidfTransformer's fit()</li> <li>tfidfs are calculated by TfidfTransformer's transform()</li> </ol> You can check the source code here. Back to your example. Here is the calculation that is done for the tfidf weight for the 5th term of the vocabulary, 1st document (X_mat[0,4]): First, the tf for 'string', in the 1st document: <pre class="prettyprint"><code>tf = 1 </code></pre> Second, the idf for 'string', with smoothing enabled (default behavior): <pre class="prettyprint"><code>df = 2 N = 4 idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238 </code></pre> And finally, the tfidf weight for (document 0, feature 4): <pre class="prettyprint"><code>tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238 </code></pre> I noticed you choose not to normalize the tfidf matrix. Keep in mind normalizing the tfidf matrix is a common and usually recommended approach, since most models will require the feature matrix (or design matrix) to be normalized. TfidfVectorizer will L-2 normalize the output matrix by default, as a final step of the calculation. Having it normalized means it will have only weights between 0 and 1.

How areTF-IDF calculated by the scikit-learn TfidfVectorizer

I run the following code to convert the text matrix to TF-IDF matrix.

text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)

X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_

I get the following output

X_vovab =

[u'calculation',
 u'computation',
 u'idf',
 u'product',
 u'string',
 u'tf',
 u'tfidf']

and X_mat =

  ([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 1.91629073,  1.91629073,  0.        ,  0.        ,  0.        ,
      0.        ,  1.51082562],
    [ 0.        ,  0.        ,  1.91629073,  1.91629073,  0.        ,
      1.91629073,  1.51082562]])

Now I dont understand how these scores are computed. My idea is that for the text[0], score for only 'string' is computed and there is a score in the 5th coloumn. But as TF_IDF is the product of term frequency which is 2 and IDF which is log(4/2) is 1.39 and not 1.51 as shown in the matrix. How is the TF-IDF score calculated in scikit-learn.

How TF-IDF is calculated in Sklearn?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

What does Sklearn TfidfVectorizer do?

Scikit-learn's CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

TF-IDF is done in multiple steps by Scikit Learn's TfidfVectorizer, which in fact uses TfidfTransformer and inherits CountVectorizer.

Let me summarize the steps it does to make it more straightforward:

tfs are calculated by CountVectorizer's fit_transform()
idfs are calculated by TfidfTransformer's fit()
tfidfs are calculated by TfidfTransformer's transform()

You can check the source code here.

Back to your example. Here is the calculation that is done for the tfidf weight for the 5th term of the vocabulary, 1st document (X_mat[0,4]):

First, the tf for 'string', in the 1st document:

tf = 1

Second, the idf for 'string', with smoothing enabled (default behavior):

df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238

And finally, the tfidf weight for (document 0, feature 4):

tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238

I noticed you choose not to normalize the tfidf matrix. Keep in mind normalizing the tfidf matrix is a common and usually recommended approach, since most models will require the feature matrix (or design matrix) to be normalized.

TfidfVectorizer will L-2 normalize the output matrix by default, as a final step of the calculation. Having it normalized means it will have only weights between 0 and 1.

The precise computation formula is given in the docs:

The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead of tf * idf

and

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.

That means 1.51082562 is obtained as 1.51082562=1+ln((4+1)/(2+1))

How areTF-IDF calculated by the scikit-learn TfidfVectorizer

Tags:

nlp

scikit-learn

tf-idf

prashanth

People also ask

Video Answer

2 Answers

Rabbit

Christian Hirsch

Recent Activity

Donate For Us

How areTF-IDF calculated by the scikit-learn TfidfVectorizer

Tags:

nlp

scikit-learn

tf-idf

prashanth

People also ask

Video Answer

2 Answers

Rabbit

Christian Hirsch

Related questions

Recent Activity

Donate For Us