I am using trying to classify some documents into two classes, in which I use TfidfVectorizer as an feature extraction technique.
Input data consists of rows of data containing about a dozen fields of float data, label and the text blob of the body of the document. In order of use the body, I applied the TfidfVectorizer and got a sparse matrix (which I can examine by converting to array via toarray() ). This matrix is usually very large, thousands by thousands dimensions - let's call this F which has size 1000 x 15000.
To use a classifier in Scikit, I give it an input matrix X which is (number of rows * number of features). If I do not use the body, I have maybe an X of size 1000 x 15.
Here is the problem, suppose I append horizontally stack this F to X, so X will become 1000 x 15015, which introduces a few problems: 1) The first 15 features will be playing a very little role now; 2) Out-of-memory;
Scikit has provided an example where using solely the TfidfVectorizer input, but shed no light on how to use it along side the metadata.
My question is: How do you use the TfidfVectorizer output along with the metadata to fit into a classifier for training?
Thank you.
Extract bag of words (tf-idf) features, call these X_tfidf
.
Extract metadata features, call these X_metadata
.
Stack them together:
X = scipy.sparse.hstack([X_tfidf, X_metadata])
If it doesn't work as expected, try re-normalizing:
from sklearn.preprocessing import normalize
X = normalize(X, copy=False)
If you use a linear estimator such as LinearSVC
, LogisticRegression
or SGDClassifier
, you shouldn't worry about the role that features play in the classification; this is the estimator's work. Linear estimators assign a weight to each individual feature that tells how informative the feature is, i.e. they figure this out for you.
(Non-parametric, distance/similarity-based models such as kernel SVMs or k-NN may have a harder time on such datasets.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With