I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula:
IDF(t,D)=log(Total Number documents/Number of Document matching term);
My questions are:
The inverse document frequency is a measure of whether a term is common or rare in a given document corpus. It is obtained by dividing the total number of documents by the number of documents containing the term in the corpus.
the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
The IDF of the word is the number of documents in the corpus separated by the frequency of the text. idf(t) = N/ df(t) = N/N(t)
Total Number documents in Corpus
is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20
.
Number of Document matching term
is the count of in how many documents the term t
occurs. So if you have 20 documents in total and the term t
occurs in 15 of the documents then the value for Number of Documents matching term
is 15.
The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249
Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf
on these documents.
A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.
Another possibility is to create a vector for the query using the idf
of each term in the query. All terms which don't occur in the query are given the value of 0
. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.
Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.
I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With