Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Caluculating IDF(Inverse Document Frequency) for document categorization

I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula:

IDF(t,D)=log(Total Number documents/Number of Document matching term);

My questions are:

  1. What does "Total Number documents in Corpus" mean? Whether the document count from a current category or from all available categories?
  2. What does "Number of Document matching term" mean? Whether the term matching document count from a current category or from all available categories?
like image 219
vignesh kumar rathakumar Avatar asked Aug 14 '12 07:08

vignesh kumar rathakumar


People also ask

How do you find the frequency of an inverse document?

The inverse document frequency is a measure of whether a term is common or rare in a given document corpus. It is obtained by dividing the total number of documents by the number of documents containing the term in the corpus.

How is IDF inverse document frequency mathematically calculated?

the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

What is IDF formula?

The IDF of the word is the number of documents in the corpus separated by the frequency of the text. idf(t) = N/ df(t) = N/N(t)


1 Answers

Total Number documents in Corpus is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20.

Number of Document matching term is the count of in how many documents the term t occurs. So if you have 20 documents in total and the term t occurs in 15 of the documents then the value for Number of Documents matching term is 15.

The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249

Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf on these documents.

A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.

Another possibility is to create a vector for the query using the idf of each term in the query. All terms which don't occur in the query are given the value of 0. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.

Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.

I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

like image 115
Sicco Avatar answered Sep 28 '22 01:09

Sicco