I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula: <pre class="prettyprint"><code>IDF(t,D)=log(Total Number documents/Number of Document matching term); </code></pre> My questions are: <ol> <li>What does "Total Number documents in Corpus" mean? Whether the document count from a current category or from all available categories?</li> <li>What does "Number of Document matching term" mean? Whether the term matching document count from a current category or from all available categories?</li> </ol>

<code>Total Number documents in Corpus</code> is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is <code>20</code>. <code>Number of Document matching term</code> is the count of in how many documents the term <code>t</code> occurs. So if you have 20 documents in total and the term <code>t</code> occurs in 15 of the documents then the value for <code>Number of Documents matching term</code> is 15. The value for this example would thus be <code>IDF(t,D)=log(20/15) = 0.1249</code> Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform <code>tf*idf</code> on these documents. A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st. Another possibility is to create a vector for the query using the <code>idf</code> of each term in the query. All terms which don't occur in the query are given the value of <code>0</code>. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity. Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus. I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

Caluculating IDF(Inverse Document Frequency) for document categorization

Tags:

machine-learning

information-retrieval

tf-idf

categorization

document-classification

I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula:

IDF(t,D)=log(Total Number documents/Number of Document matching term);

My questions are:

What does "Total Number documents in Corpus" mean? Whether the document count from a current category or from all available categories?
What does "Number of Document matching term" mean? Whether the term matching document count from a current category or from all available categories?

219

asked Aug 14 '12 07:08

vignesh kumar rathakumar

1 Answers

Total Number documents in Corpus is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20.

Number of Document matching term is the count of in how many documents the term t occurs. So if you have 20 documents in total and the term t occurs in 15 of the documents then the value for Number of Documents matching term is 15.

The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249

Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf on these documents.

A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.

Another possibility is to create a vector for the query using the idf of each term in the query. All terms which don't occur in the query are given the value of 0. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.

Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.

I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

115

answered Sep 28 '22 01:09

Sicco

Related questions
                            
                                How Bagging in LightGBM works
                            
                                xgboost predict_proba : How to do the mapping between the probabilities and the labels
                            
                                How to scale train, validation and test sets properly using StandardScaler?
                            
                                How to parse the heatmap output for the pose estimation tflite model?
                            
                                sklearn ImportError: cannot import name plot_roc_curve
                            
                                When does dataloader shuffle happen for Pytorch?
                            
                                What is freezing/unfreezing a layer in neural networks?
                            
                                Function call stack: train_function
                            
                                Extracting Dominant / Most Used Colors from an Image
                            
                                Nominal Attributes in LibSVM
                            
                                What ONLINE to use to learn the basics of AI and Machine learning [closed]
                            
                                How to select training data for naive bayes classifier
                            
                                Scikits-learn: Use custom vocabulary together with Pipeline
                            
                                What is the best way to perform vector crossover in genetic algorithm?
                            
                                Separation and pattern matching techniques
                            
                                Optimal Feature-to-Instance Ratio in Back Propagation Neural Network
                            
                                How to classify documents indexed with lucene
                            
                                linear regression using categories as features
                            
                                Difference between regression tree and model tree
                            
                                Does R randomForest's rfcv method actually say which features it selected, or not?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With