How do I do classification using TfidfVectorizer plus metadata in practice?

Question

I am using trying to classify some documents into two classes, in which I use TfidfVectorizer as an feature extraction technique.

Input data consists of rows of data containing about a dozen fields of float data, label and the text blob of the body of the document. In order of use the body, I applied the TfidfVectorizer and got a sparse matrix (which I can examine by converting to array via toarray() ). This matrix is usually very large, thousands by thousands dimensions - let's call this F which has size 1000 x 15000.

To use a classifier in Scikit, I give it an input matrix X which is (number of rows * number of features). If I do not use the body, I have maybe an X of size 1000 x 15.

Here is the problem, suppose I append horizontally stack this F to X, so X will become 1000 x 15015, which introduces a few problems: 1) The first 15 features will be playing a very little role now; 2) Out-of-memory;

Scikit has provided an example where using solely the TfidfVectorizer input, but shed no light on how to use it along side the metadata.

My question is: How do you use the TfidfVectorizer output along with the metadata to fit into a classifier for training?

Thank you.

Fred Foo · Accepted Answer

Extract bag of words (tf-idf) features, call these X_tfidf.
Extract metadata features, call these X_metadata.

Stack them together:

X = scipy.sparse.hstack([X_tfidf, X_metadata])

If it doesn't work as expected, try re-normalizing:

from sklearn.preprocessing import normalize
X = normalize(X, copy=False)

If you use a linear estimator such as LinearSVC, LogisticRegression or SGDClassifier, you shouldn't worry about the role that features play in the classification; this is the estimator's work. Linear estimators assign a weight to each individual feature that tells how informative the feature is, i.e. they figure this out for you.

(Non-parametric, distance/similarity-based models such as kernel SVMs or k-NN may have a harder time on such datasets.)

How do I do classification using TfidfVectorizer plus metadata in practice?

Tags:

machine-learning

classification

scikit-learn

tf-idf

log0

1 Answers

Fred Foo

Recent Activity

Donate For Us

How do I do classification using TfidfVectorizer plus metadata in practice?

Tags:

machine-learning

classification

scikit-learn

tf-idf

log0

1 Answers

Fred Foo

Related questions

Recent Activity

Donate For Us