How to Extend Scipy Sparse Matrix returned by sklearn TfIdfVectorizer to hold more features

Question

I am working on a text classification problem using scikit-learn classifiers and text feature extractor, particularly TfidfVectorizer class.

The problem is that I have two kinds of features, the first are captured by the n-grams obtained from TfidfVectorizer and the other are domain specific features that I extract from each document. I need to combine both features in a single feature vector for each document; to do this I need to update the scipy sparse matrix returned by TfidfVectorizer by adding a new dimension in each row holding the domain feature for this document. However, I can't find a neat way to do this, by neat I mean not converting the sparse matrix into a dense one since simply it won't fit in memory.

Probably I am missing a feature in scikit-learn or something, since I am new to both scipy and scikit-learn.

Andreas Mueller · Accepted Answer

I think the easiest would be to create a new sparse matrix with your custom features and then use scipy.sparse.hstack to stack the features. You might also find the "FeatureUnion" from the pipeline module helpful.

How to Extend Scipy Sparse Matrix returned by sklearn TfIdfVectorizer to hold more features

Tags:

python-2.7

scipy

scikit-learn

sparse-matrix

iBrAaAa

1 Answers

Andreas Mueller

Recent Activity

Donate For Us

How to Extend Scipy Sparse Matrix returned by sklearn TfIdfVectorizer to hold more features

Tags:

python-2.7

scipy

scikit-learn

sparse-matrix

iBrAaAa

1 Answers

Andreas Mueller

Related questions

Recent Activity

Donate For Us