Calculating IDF using TfidfVectorizer from sklearn.feature_extraction.text.TfidfVectorizer

Tags:

python

scikit-learn

I think the function TfidfVectorizer is not calculating correctly the IDF factor. For example, copying the code from tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(
                        use_idf=True, # utiliza o idf como peso, fazendo tf*idf
                        norm=None, # normaliza os vetores
                        smooth_idf=False, #soma 1 ao N e ao ni => idf = ln(N+1 / ni+1)
                        sublinear_tf=False, #tf = 1+ln(tf)
                        binary=False,
                        min_df=1, max_df=1.0, max_features=None,
                        strip_accents='unicode', # retira os acentos
                        ngram_range=(1,1), preprocessor=None,              stop_words=None, tokenizer=None, vocabulary=None
             )
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

The Output is:

{u'is': 1.0,
 u'nice': 1.6931471805599454,
 u'strange': 1.6931471805599454,
 u'this': 1.0,
 u'very': 1.0}`

But should be:

{u'is': 0.0,
 u'nice': 0.6931471805599454,
 u'strange': 0.6931471805599454,
 u'this': 0.0,
 u'very': 0.0}

Isn't it? What am I doing wrong?

Whereas the calculation of IDF, according to http://www.tfidf.com/, is:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Thus, as the terms 'this', 'is' and 'very' appear in two sentences, the IDF = log_e (2/2) = 0.

The terms 'strange' and 'nice' appear in only one of the two documents, so log_e(2/1) = 0,69314.

653

asked Apr 20 '16 22:04

Priscilla Lusie

2 Answers

Two things are happening that you might not expect in the sklearn implimentation:

The TfidfTransformer has smooth_idf=True as a default param
It always adds 1 to the weight

So it is using:

idf = log( 1 + samples/documents) + 1

Here it is in the source:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992

EDIT: You could subclass the standard TfidfVectorizer class like this:

import scipy.sparse as sp
import numpy as np
from sklearn.feature_extraction.text import (TfidfVectorizer,
                                             _document_frequency)
class PriscillasTfidfVectorizer(TfidfVectorizer):

    def fit(self, X, y=None):
        """Learn the idf vector (global term weights)
        Parameters
        ----------
        X : sparse matrix, [n_samples, n_features]
            a matrix of term/token counts
        """
        if not sp.issparse(X):
            X = sp.csc_matrix(X)
        if self.use_idf:
            n_samples, n_features = X.shape
            df = _document_frequency(X)

            # perform idf smoothing if required
            df += int(self.smooth_idf)
            n_samples += int(self.smooth_idf)

            # log+1 instead of log makes sure terms with zero idf don't get
            # suppressed entirely.
            ####### + 1 is commented out ##########################
            idf = np.log(float(n_samples) / df) #+ 1.0  
            #######################################################
            self._idf_diag = sp.spdiags(idf,
                                        diags=0, m=n_features, n=n_features)

        return self

132

answered Oct 11 '22 05:10

zemekeneng

The actual formula they use in computing idf (when smooth_idf is True) is

idf = log( (1 + samples)/(documents + 1)) + 1

It's in the source but the web documentation is a little bit ambiguous about it I think.

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/feature_extraction/text.py#L966-L969

answered Oct 11 '22 03:10

LaserfaceDom

Related questions
                            
                                Tracing code execution in embedded Python interpreter
                            
                                Forecasting time series data with PyBrain Neural Networks
                            
                                Finding top N columns for each row in data frame
                            
                                Numpy and static linking
                            
                                Can celery assign task to specify worker
                            
                                Vectorize integration of pandas.DataFrame
                            
                                Why python numpy.delete does not raise indexError when out-of-bounds index is in np array
                            
                                Python unittest failing to resolve import statements
                            
                                How to merge two pandas DataFrames based on a similarity function?
                            
                                Python: Generate random values from empirical distribution
                            
                                Solving reaction-diffusion system with Theano
                            
                                Django get all descendant child models using django queryset
                            
                                Speeding up distance between all possible pairs in an array
                            
                                How to preserve newlines in argparse version output while letting argparse auto-format/wrap the remaining help message?
                            
                                Fastest way to get union of lists - Python
                            
                                How To Change Bar Chart Values to Percentages (Matplotlib) [duplicate]
                            
                                Encode Base64 Django ImageField Stream
                            
                                Django settings not configured error
                            
                                Tensor Flow: Ran out of memory trying to allocate
                            
                                Summing data from array based on other array in Numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With