I think the function TfidfVectorizer is not calculating correctly the IDF factor. For example, copying the code from tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer(
use_idf=True, # utiliza o idf como peso, fazendo tf*idf
norm=None, # normaliza os vetores
smooth_idf=False, #soma 1 ao N e ao ni => idf = ln(N+1 / ni+1)
sublinear_tf=False, #tf = 1+ln(tf)
binary=False,
min_df=1, max_df=1.0, max_features=None,
strip_accents='unicode', # retira os acentos
ngram_range=(1,1), preprocessor=None, stop_words=None, tokenizer=None, vocabulary=None
)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
The Output is:
{u'is': 1.0,
u'nice': 1.6931471805599454,
u'strange': 1.6931471805599454,
u'this': 1.0,
u'very': 1.0}`
But should be:
{u'is': 0.0,
u'nice': 0.6931471805599454,
u'strange': 0.6931471805599454,
u'this': 0.0,
u'very': 0.0}
Isn't it? What am I doing wrong?
Whereas the calculation of IDF, according to http://www.tfidf.com/, is:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
Thus, as the terms 'this', 'is' and 'very' appear in two sentences, the IDF = log_e (2/2) = 0.
The terms 'strange' and 'nice' appear in only one of the two documents, so log_e(2/1) = 0,69314.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
text . TfidfVectorizer. Convert a collection of raw documents to a matrix of TF-IDF features.
The formula for IDF starts with the total number of documents in our database: N. Then we divide this by the number of documents containing our term: tD.
The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.
Two things are happening that you might not expect in the sklearn implimentation:
TfidfTransformer
has smooth_idf=True
as a default paramSo it is using:
idf = log( 1 + samples/documents) + 1
Here it is in the source:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992
EDIT:
You could subclass the standard TfidfVectorizer
class like this:
import scipy.sparse as sp
import numpy as np
from sklearn.feature_extraction.text import (TfidfVectorizer,
_document_frequency)
class PriscillasTfidfVectorizer(TfidfVectorizer):
def fit(self, X, y=None):
"""Learn the idf vector (global term weights)
Parameters
----------
X : sparse matrix, [n_samples, n_features]
a matrix of term/token counts
"""
if not sp.issparse(X):
X = sp.csc_matrix(X)
if self.use_idf:
n_samples, n_features = X.shape
df = _document_frequency(X)
# perform idf smoothing if required
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)
# log+1 instead of log makes sure terms with zero idf don't get
# suppressed entirely.
####### + 1 is commented out ##########################
idf = np.log(float(n_samples) / df) #+ 1.0
#######################################################
self._idf_diag = sp.spdiags(idf,
diags=0, m=n_features, n=n_features)
return self
The actual formula they use in computing idf (when smooth_idf is True) is
idf = log( (1 + samples)/(documents + 1)) + 1
It's in the source but the web documentation is a little bit ambiguous about it I think.
https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/feature_extraction/text.py#L966-L969
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With