I am playing with scikit-learn
to find the tf-idf
values.
I have a set of documents
like:
D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."
I want to create a matrix like this:
Docs blue bright sky sun
D1 tf-idf 0.0000000 tf-idf 0.0000000
D2 0.0000000 tf-idf 0.0000000 tf-idf
D3 0.0000000 tf-idf tf-idf tf-idf
So, my code in Python
is:
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')
transformer = TfidfVectorizer(stop_words=stop_words)
t1 = transformer.fit_transform(train_set).todense()
print t1
The result matrix I get is:
[[ 0.79596054 0. 0.60534851 0. ]
[ 0. 0.4472136 0. 0.89442719]
[ 0. 0.57735027 0.57735027 0.57735027]]
If I do a hand calculation then the matrix should be:
Docs blue bright sky sun
D1 0.2385 0.0000000 0.0880 0.0000000
D2 0.0000000 0.0880 0.0000000 0.0880
D3 0.0000000 0.058 0.058 0.058
I am calculating like say blue
as tf
= 1/2 = 0.5
and idf
as log(3/1) = 0.477121255
. Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385
. In this way, I am calculating the other tf-idf
values. Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of Python? Which gives the correct results? Am I doing something wrong in hand calculation or is there something wrong in my Python code?
There are two reasons:
According to source sklearn does not use such assumptions.
First, it smooths document count (so there is no 0
, ever):
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)
and it uses natural logarithm (np.log(np.e)==1
)
idf = np.log(float(n_samples) / df) + 1.0
There is also default l2
normalization applied. In short, scikit-learn does much more "nice, little things" while computing tfidf. None of these approaches (their or yours) is bad. Their is simply more advanced.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With