Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can TF/IDF take classes in account

Using a classsication algorythm (for example naive bayes or SVM), and StringToWordVector, would it be possible to use TF/IDF and to count terms frequency in the whole current class instead of just looking in a single document?

Let me explain, I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.

Is it possible out of the box or does this need some extra developments?

Thanks :)

like image 818
Loic Avatar asked Oct 11 '13 15:10

Loic


People also ask

Can we use TF-IDF for classification?

It is possible to classify bodies of text by looking at the frequencies of words in the text. In this post we will look at doing just that. This tool can be used to classify emails as spam or ham, to classify news as real or fake, or a myriad of other things.

What are two limitations of the TF-IDF representation?

However, TF-IDF has several limitations: – It computes document similarity directly in the word-count space, which may be slow for large vocabularies. – It assumes that the counts of different words provide independent evidence of similarity. – It makes no use of semantic similarities between words.

What is class based TF-IDF?

c-TF-IDF. c-TF-IDF is a class-based TF-IDF procedure that can be used to generate features from textual documents based on the class they are in. Typical applications: Informative Words per Class: Which words make a class stand-out compared to all others? Class Reduction: Using c-TF-IDF to reduce the number of classes.

What can TF-IDF be used for?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...


2 Answers

I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.

You seem to want supervised term weighting. I'm not aware of any off-the-shelf implementation of that, but there's a host of literature about it. E.g. the weighting scheme tf-χ² replaces idf with the result of a χ² independence test, so terms that statistically depend on certain classes get boosted, and there are several others.

Tf-idf itself is by its very nature unsupervised.

like image 124
Fred Foo Avatar answered Oct 05 '22 14:10

Fred Foo


I think you're confusing yourself here---what you're asking for is essentially the feature weight on that term for documents of that class. This is what the learning algorithm is intended to optimise. Just worry about a useful representation of documents, which must necessarily be invariant to the class to which they belong (since you won't know what the class is for unseen test documents).

like image 39
Ben Allison Avatar answered Oct 05 '22 14:10

Ben Allison