How to calculate TF*IDF for a single new document to be classified?

Tags:

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.

I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.

My question is, how could I calculate the TF*IDF with just a single document?

As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as training set and the others as test set.

I just suddenly realized that this seems not so applicable to real life.

ADD 1

So there are actually 2 subtly different scenarios for classification:

to classify some documents whose content are known but label are not known.
to classify some totally unseen document.

For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.

But my scenario is 2.

Suppose I have the following information for term T from the summary of the training set corpus:

document count for T in the training set is n
total number of training documents is N

Should I calculate the IDF of t for a unseen document D as below?

IDF(t, D)= log((N+1)/(n+1))

ADD 2

And what if I encounter a term in the new document which didn't show up in the training corpus before? How should I calculate the weight for it in the doc-term vector?

476

asked Apr 01 '14 15:04

smwikipedia

3 Answers

TF-IDF doesn't make sense for a single document, independent of a corpus. It's fundamentally about emphasizing relatively rare and informative words.

You need to keep corpus summary information in order to compute TF-IDF weights. In particular, you need the document count for each term and the total number of documents.

Whether you want to use summary information from the whole training set and test set for TF-IDF, or for just the training set is a matter of your problem formulation. If it's the case that you only care to apply your classification system to documents whose contents you have, but whose labels you do not have (this is actually pretty common), then using TF-IDF for the entire corpus is okay. If you want to apply your classification system to entirely unseen documents after you train, then you only want to use the TF-IDF summary information from the training set.

120

answered Oct 18 '22 09:10

Rob Neuhaus

TF obviously only depends on the new document.

IDF, you compute only on your training corpus.

You can add a slack term to the IDF computation, or adjust it as you suggested. But for a reasonable training set, the constant +1 term will not have a whole lot of effect. AFAICT, in classic document retrieval (think: search), you don't bother to do this. Often, they query document will not become part of your corpus, so why would it be part of IDF?

answered Oct 18 '22 08:10

Has QUIT--Anony-Mousse

For unseen words, TF calculation is not a problem as TF is a document specific metric. While computing IDF, you can use smoothed inverse document frequency technique.

IDF = 1 + log(total documents / document frequency of a term)

Here the lower bound for IDF is 1. So if a word is not seen in the training corpus, its IDF is 1. Since, there is no universally agreed single formula for computing tf-idf or even idf, your formula for tf-idf calculation is also reasonable.

Note that, in many cases, unseen terms are ignored if they don't have much impact in the classification task. Sometimes, people replace unseen tokens with a special symbol like UNKNOWN_TOKEN and do their computation.

Alternative of TF-IDF: Another way of computing weight of each term of a document is using Maximum Likelihood Estimation. While computing MLE, you can smooth using additive smoothing technique which is also known as Laplace smoothing. MLE is used in case you are using Generative models like Naive Bayes algorithm for document classification.

answered Oct 18 '22 07:10

Wasi Ahmad

Related questions
                            
                                Part of Speech (POS) tag Feature Selection for Text Classification
                            
                                Is there a well-designed, maintained decision tree learning library for Java?
                            
                                Supervised Learning for User Behavior over Time
                            
                                Supervised learning with multiple sources of training data
                            
                                Length normalization in a naive Bayes classifier for documents
                            
                                How to retrieve class values from WEKA using MATLAB
                            
                                Code generation with Machine learning [closed]
                            
                                What would be a good application for an enhanced version of MapReduce that shares information between Mappers?
                            
                                LIBLINEAR/LIBSVM "Wrong input format at line 1"
                            
                                How to decode speech input
                            
                                How to get out of 'sticky' states? [closed]
                            
                                does mallet have a GUI?
                            
                                machine learning in Python to play checkers? [closed]
                            
                                Sklearn Transformers: How to apply encoder to multiple columns and reuse it in production?
                            
                                True Positive Rate and False Positive Rate (TPR, FPR) for Multi-Class Data in python [duplicate]
                            
                                Using Artificial Intelligence (AI) to predict Stock Prices
                            
                                Cross-validation in LightGBM
                            
                                Implementing sparse connections in neural network (Theano)
                            
                                tensorflow:Your input ran out of data
                            
                                How to construct a network with two inputs in PyTorch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to calculate TF*IDF for a single new document to be classified?

Tags:

machine-learning

classification

text-mining

information-retrieval

document-classification