I am working on a text classification problem using scikit-learn classifiers and text feature extractor, particularly TfidfVectorizer class.
The problem is that I have two kinds of features, the first are captured by the n-grams obtained from TfidfVectorizer and the other are domain specific features that I extract from each document. I need to combine both features in a single feature vector for each document; to do this I need to update the scipy sparse matrix returned by TfidfVectorizer by adding a new dimension in each row holding the domain feature for this document. However, I can't find a neat way to do this, by neat I mean not converting the sparse matrix into a dense one since simply it won't fit in memory.
Probably I am missing a feature in scikit-learn or something, since I am new to both scipy and scikit-learn.
I think the easiest would be to create a new sparse matrix with your custom features and then use scipy.sparse.hstack
to stack the features.
You might also find the "FeatureUnion" from the pipeline module helpful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With