In scikit-learn TfidfVectorizer
allows us to fit over training data, and later use the same vectorizer to transform over our test data.
The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document.
However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either:
I have tried deducing the operation from scikit-learn's source code but could not quite figure it out. Is it one of the options I've previously mentioned or something else entirely? Please assist.
As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator. vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.
TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It's a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks.
Here is a general guideline: 1 If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer. 2 If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer 3 If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.
Then, by invoking tfidf_transformer.transform (count_vector) you will finally be computing the tf-idf scores for your docs. Internally this is computing the tf * idf multiplication where your term frequency is weighted by its IDF values. Now, let’s print the tf-idf values of the first document to see if it makes sense.
When training a model it is possible to train the Tfidf on the corpus of only the training set or also on the test set. It seems not to make sense to include the test corpus when training the model, though since it is not supervised, it is also possible to train it on the whole corpus. What is better to do? Show activity on this post.
? 2. Initialize CountVectorizer In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The code below does just that. ? Now, let’s check the shape.
It is definitely the former: each word's idf
(inverse document-frequency) is calculated based on the training documents only. This makes sense because these values are precisely the ones that are calculated when you call fit
on your vectorizer. If the second option you describe was true, we would essentially refit a vectorizer each time, and we would also cause information leak
as idf's from the test set would be used during model evaluation.
Beyond these purely conceptual explanations, you can also run the following code to convince yourself:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train = ["We love apples", "We really love bananas"]
vect.fit(x_train)
print(vect.get_feature_names())
>>> ['apples', 'bananas', 'love', 'really', 'we']
x_test = ["We really love pears"]
vectorized = vect.transform(x_test)
print(vectorized.toarray())
>>> array([[0. , 0. , 0.50154891, 0.70490949, 0.50154891]])
Following the reasoning of how the fit methodology works, you can recalculate these tfidf values yourself:
"apples" and "bananas" obviously have a tfidf score of 0 because they do not appear in x_test
. "pears", on the other hand, does not exist in x_train
and so will not even appear in the vectorization. Hence, only "love", "really" and "we" will have a tfidf score.
Scikit-learn implements tfidf as log((1+n)/(1+df) + 1) * f where n is the number of documents in the training set (2 for us), df the number of documents in which the word appears in the training set only, and f the frequency count of the word in the test set. Hence:
tfidf_love = (np.log((1+2)/(1+2))+1)*1
tfidf_really = (np.log((1+2)/(1+1))+1)*1
tfidf_we = (np.log((1+2)/(1+2))+1)*1
You then need to scale these tfidf scores by the L2 distance of your document:
tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we])
tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5
print(tfidf_list)
>>> [0.50154891 0.70490949 0.50154891]
You can see that indeed, we are getting the same values, which confirms the way scikit-learn
implemented this methodology.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With