Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in values of tf-idf matrix using scikit-learn and hand calculation

I am playing with scikit-learn to find the tf-idf values.

I have a set of documents like:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

I want to create a matrix like this:

   Docs      blue    bright       sky       sun
   D1 tf-idf 0.0000000 tf-idf 0.0000000
   D2 0.0000000 tf-idf 0.0000000 tf-idf
   D3 0.0000000 tf-idf tf-idf tf-idf

So, my code in Python is:

import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)

t1 = transformer.fit_transform(train_set).todense()
print t1

The result matrix I get is:

[[ 0.79596054  0.          0.60534851  0.        ]
 [ 0.          0.4472136   0.          0.89442719]
 [ 0.          0.57735027  0.57735027  0.57735027]]

If I do a hand calculation then the matrix should be:

            Docs  blue      bright       sky       sun
            D1    0.2385    0.0000000  0.0880    0.0000000
            D2    0.0000000 0.0880     0.0000000 0.0880
            D3    0.0000000 0.058      0.058     0.058 

I am calculating like say blue as tf = 1/2 = 0.5 and idf as log(3/1) = 0.477121255. Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385. In this way, I am calculating the other tf-idf values. Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of Python? Which gives the correct results? Am I doing something wrong in hand calculation or is there something wrong in my Python code?

like image 507
user2481422 Avatar asked Jun 04 '14 08:06

user2481422


1 Answers

There are two reasons:

  1. You are neglecting smoothing which often occurs in such cases
  2. You are assuming logarithm of base 10

According to source sklearn does not use such assumptions.

First, it smooths document count (so there is no 0, ever):

df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

and it uses natural logarithm (np.log(np.e)==1)

idf = np.log(float(n_samples) / df) + 1.0

There is also default l2 normalization applied. In short, scikit-learn does much more "nice, little things" while computing tfidf. None of these approaches (their or yours) is bad. Their is simply more advanced.

like image 133
lejlot Avatar answered Oct 15 '22 06:10

lejlot