Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Extend Scipy Sparse Matrix returned by sklearn TfIdfVectorizer to hold more features

I am working on a text classification problem using scikit-learn classifiers and text feature extractor, particularly TfidfVectorizer class.

The problem is that I have two kinds of features, the first are captured by the n-grams obtained from TfidfVectorizer and the other are domain specific features that I extract from each document. I need to combine both features in a single feature vector for each document; to do this I need to update the scipy sparse matrix returned by TfidfVectorizer by adding a new dimension in each row holding the domain feature for this document. However, I can't find a neat way to do this, by neat I mean not converting the sparse matrix into a dense one since simply it won't fit in memory.

Probably I am missing a feature in scikit-learn or something, since I am new to both scipy and scikit-learn.

like image 814
iBrAaAa Avatar asked Apr 10 '13 23:04

iBrAaAa


1 Answers

I think the easiest would be to create a new sparse matrix with your custom features and then use scipy.sparse.hstack to stack the features. You might also find the "FeatureUnion" from the pipeline module helpful.

like image 102
Andreas Mueller Avatar answered Sep 28 '22 10:09

Andreas Mueller