Difference in values of tf-idf matrix using scikit-learn and hand calculation

Question

I am playing with scikit-learn to find the tf-idf values.

I have a set of documents like:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

I want to create a matrix like this:

   Docs      blue    bright       sky       sun
   D1 tf-idf 0.0000000 tf-idf 0.0000000
   D2 0.0000000 tf-idf 0.0000000 tf-idf
   D3 0.0000000 tf-idf tf-idf tf-idf

So, my code in Python is:

import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)

t1 = transformer.fit_transform(train_set).todense()
print t1

The result matrix I get is:

[[ 0.79596054  0.          0.60534851  0.        ]
 [ 0.          0.4472136   0.          0.89442719]
 [ 0.          0.57735027  0.57735027  0.57735027]]

If I do a hand calculation then the matrix should be:

            Docs  blue      bright       sky       sun
            D1    0.2385    0.0000000  0.0880    0.0000000
            D2    0.0000000 0.0880     0.0000000 0.0880
            D3    0.0000000 0.058      0.058     0.058

I am calculating like say blue as tf = 1/2 = 0.5 and idf as log(3/1) = 0.477121255. Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385. In this way, I am calculating the other tf-idf values. Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of Python? Which gives the correct results? Am I doing something wrong in hand calculation or is there something wrong in my Python code?

lejlot · Accepted Answer

There are two reasons:

You are neglecting smoothing which often occurs in such cases
You are assuming logarithm of base 10

According to source sklearn does not use such assumptions.

First, it smooths document count (so there is no 0, ever):

df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

and it uses natural logarithm (np.log(np.e)==1)

idf = np.log(float(n_samples) / df) + 1.0

There is also default l2 normalization applied. In short, scikit-learn does much more "nice, little things" while computing tfidf. None of these approaches (their or yours) is bad. Their is simply more advanced.

Difference in values of tf-idf matrix using scikit-learn and hand calculation

Tags:

python

machine-learning

matrix

tf-idf

user2481422

1 Answers

lejlot

Recent Activity

Donate For Us

Difference in values of tf-idf matrix using scikit-learn and hand calculation

Tags:

python

machine-learning

matrix

tf-idf

user2481422

1 Answers

lejlot

Related questions

Recent Activity

Donate For Us