In scikit-learn <code>TfidfVectorizer</code> allows us to fit over training data, and later use the same vectorizer to transform over our test data. The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document. However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either: <ol> <li>The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the training set.</li> <li>The new document is 'added' to the existing corpus and new scores are calculated.</li> </ol> I have tried deducing the operation from scikit-learn's source code but could not quite figure it out. Is it one of the options I've previously mentioned or something else entirely? Please assist.

It is definitely the former: each word's <code>idf</code> (inverse document-frequency) is calculated based on the training documents only. This makes sense because these values are precisely the ones that are calculated when you call <code>fit</code> on your vectorizer. If the second option you describe was true, we would essentially refit a vectorizer each time, and we would also cause <code>information leak</code> as idf's from the test set would be used during model evaluation. Beyond these purely conceptual explanations, you can also run the following code to convince yourself: <pre class="prettyprint"><code>from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer() x_train = ["We love apples", "We really love bananas"] vect.fit(x_train) print(vect.get_feature_names()) >>> ['apples', 'bananas', 'love', 'really', 'we'] x_test = ["We really love pears"] vectorized = vect.transform(x_test) print(vectorized.toarray()) >>> array([[0. , 0. , 0.50154891, 0.70490949, 0.50154891]]) </code></pre> Following the reasoning of how the fit methodology works, you can recalculate these tfidf values yourself: "apples" and "bananas" obviously have a tfidf score of 0 because they do not appear in <code>x_test</code>. "pears", on the other hand, does not exist in <code>x_train</code> and so will not even appear in the vectorization. Hence, only "love", "really" and "we" will have a tfidf score. Scikit-learn implements tfidf as log((1+n)/(1+df) + 1) * f where n is the number of documents in the training set (2 for us), df the number of documents in which the word appears in the training set only, and f the frequency count of the word in the test set. Hence: <pre class="prettyprint"><code>tfidf_love = (np.log((1+2)/(1+2))+1)*1 tfidf_really = (np.log((1+2)/(1+1))+1)*1 tfidf_we = (np.log((1+2)/(1+2))+1)*1 </code></pre> You then need to scale these tfidf scores by the L2 distance of your document: <pre class="prettyprint"><code>tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we]) tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5 print(tfidf_list) >>> [0.50154891 0.70490949 0.50154891] </code></pre> You can see that indeed, we are getting the same values, which confirms the way <code>scikit-learn</code> implemented this methodology.

How does TfidfVectorizer compute scores on test data

Tags:

nlp

scikit-learn

tf-idf

tfidfvectorizer

In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data. The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document.

However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either:

The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the training set.
The new document is 'added' to the existing corpus and new scores are calculated.

I have tried deducing the operation from scikit-learn's source code but could not quite figure it out. Is it one of the options I've previously mentioned or something else entirely? Please assist.

321

asked Apr 16 '19 11:04

Yuval Cohen

1 Answers

It is definitely the former: each word's idf (inverse document-frequency) is calculated based on the training documents only. This makes sense because these values are precisely the ones that are calculated when you call fit on your vectorizer. If the second option you describe was true, we would essentially refit a vectorizer each time, and we would also cause information leak as idf's from the test set would be used during model evaluation.

Beyond these purely conceptual explanations, you can also run the following code to convince yourself:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train = ["We love apples", "We really love bananas"]
vect.fit(x_train)
print(vect.get_feature_names())
>>> ['apples', 'bananas', 'love', 'really', 'we']

x_test = ["We really love pears"]

vectorized = vect.transform(x_test)
print(vectorized.toarray())
>>> array([[0.        , 0.        , 0.50154891, 0.70490949, 0.50154891]])

Following the reasoning of how the fit methodology works, you can recalculate these tfidf values yourself:

"apples" and "bananas" obviously have a tfidf score of 0 because they do not appear in x_test. "pears", on the other hand, does not exist in x_train and so will not even appear in the vectorization. Hence, only "love", "really" and "we" will have a tfidf score.

Scikit-learn implements tfidf as log((1+n)/(1+df) + 1) * f where n is the number of documents in the training set (2 for us), df the number of documents in which the word appears in the training set only, and f the frequency count of the word in the test set. Hence:

tfidf_love = (np.log((1+2)/(1+2))+1)*1
tfidf_really = (np.log((1+2)/(1+1))+1)*1
tfidf_we = (np.log((1+2)/(1+2))+1)*1

You then need to scale these tfidf scores by the L2 distance of your document:

tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we])
tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5

print(tfidf_list)
>>> [0.50154891 0.70490949 0.50154891]

You can see that indeed, we are getting the same values, which confirms the way scikit-learn implemented this methodology.

153

answered Oct 13 '22 20:10

MaximeKan

Related questions
                            
                                Python, why is my probabilistic neural network (PNN) always predicting zeros?
                            
                                Stratified Train/Validation/Test-split in scikit-learn
                            
                                Tested implementation of APriori and FP-growth in python [closed]
                            
                                sklearn issue: Found arrays with inconsistent numbers of samples when doing regression
                            
                                Linear Regression with positive coefficients in Python
                            
                                DeprecationWarning in sklearn MiniBatchKMeans
                            
                                Attribute's predictive capacity for a particular target in Python, using feature selection in Sklearn
                            
                                How to optimize a sklearn pipeline, using XGboost, for a different `eval_metric`?
                            
                                Multiple output regression or classifier with one (or more) parameters with Python
                            
                                Efficient nearest neighbour search for sparse matrices
                            
                                scikit-learn: fitting data into chunks vs fitting it all at once
                            
                                TypeError: unhashable type
                            
                                How to fit different inputs into an sklearn Pipeline?
                            
                                Undo L2 Normalization in sklearn python
                            
                                Error using sklearn and linear regression: shapes (1,16) and (1,1) not aligned: 16 (dim 1) != 1 (dim 0)
                            
                                Multiple classification models in a scikit pipeline python
                            
                                Loss, metrics, and scoring in Keras
                            
                                How to run GridSearchCV without cross-validation?
                            
                                python sklearn: what is the difference between accuracy_score and learning_curve score?
                            
                                Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With