Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do classification using TfidfVectorizer plus metadata in practice?

I am using trying to classify some documents into two classes, in which I use TfidfVectorizer as an feature extraction technique.

Input data consists of rows of data containing about a dozen fields of float data, label and the text blob of the body of the document. In order of use the body, I applied the TfidfVectorizer and got a sparse matrix (which I can examine by converting to array via toarray() ). This matrix is usually very large, thousands by thousands dimensions - let's call this F which has size 1000 x 15000.

To use a classifier in Scikit, I give it an input matrix X which is (number of rows * number of features). If I do not use the body, I have maybe an X of size 1000 x 15.

Here is the problem, suppose I append horizontally stack this F to X, so X will become 1000 x 15015, which introduces a few problems: 1) The first 15 features will be playing a very little role now; 2) Out-of-memory;

Scikit has provided an example where using solely the TfidfVectorizer input, but shed no light on how to use it along side the metadata.

My question is: How do you use the TfidfVectorizer output along with the metadata to fit into a classifier for training?

Thank you.

like image 953
log0 Avatar asked Oct 19 '13 14:10

log0


1 Answers

  1. Extract bag of words (tf-idf) features, call these X_tfidf.

  2. Extract metadata features, call these X_metadata.

  3. Stack them together:

    X = scipy.sparse.hstack([X_tfidf, X_metadata])
    
  4. If it doesn't work as expected, try re-normalizing:

    from sklearn.preprocessing import normalize
    X = normalize(X, copy=False)
    

If you use a linear estimator such as LinearSVC, LogisticRegression or SGDClassifier, you shouldn't worry about the role that features play in the classification; this is the estimator's work. Linear estimators assign a weight to each individual feature that tells how informative the feature is, i.e. they figure this out for you.

(Non-parametric, distance/similarity-based models such as kernel SVMs or k-NN may have a harder time on such datasets.)

like image 69
Fred Foo Avatar answered Oct 12 '22 11:10

Fred Foo