Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the correct way to mix feature sparse matrices with sklearn?

The other day I was dealing with a machine learning task that required to extract several types of feature matrices. I save this feature matrices as numpy arrays in disk in order to later use them in some estimator (this was a classification task). After all, when I wanted to use all the features I just concatenated the matrices in order to have a big feature matrix. When I obtained this big feature matrix I presented it to an estimator.

I do not know if this is the correct way to work with a feature matrix that has a lot of patterns (counts) in it. What other approaches should I use to mix correctly several types of features?. However, looking through the documentation I found FeatureUnion that seems to do this task.

For example, Let's say I would like to create a big feature matrix of 3 vectorizer approaches TfidfVectorizer, CountVectorizer and HashingVectorizer This is what I tried following the documentation example:

#Read the .csv file
import pandas as pd
df = pd.read_csv('file.csv',
                     header=0, sep=',', names=['id', 'text', 'labels'])

#vectorizer 1
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(use_idf=True, smooth_idf=True,
                             sublinear_tf=False, ngram_range=(2,2))
#vectorizer 2
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(2,2))

#vectorizer 3
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(ngram_range=(2,2))


#Combine the above vectorizers in one single feature matrix:

from sklearn.pipeline import  FeatureUnion
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
                                  ("bow", bow),
                                  ("hash",hash_vect)])

X_combined_features = combined_features.fit_transform(df['text'].values)
y = df['labels'].values

#Check the matrix
print X_combined_features.toarray()

Then:

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

Split the data:

from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_combined_features,y, test_size=0.33)

So I have a few questions: Is this the right approach to mix several feature extractors in order to yield a big feature matrix? and assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?.

update

Let's say that I have a matrix like this:

Matrix A ((152, 33))

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

Then with my vectorizer that returns a numpy array I get this feature matrix:

Matrix B ((152, 10))

[[4210  228   25 ...,    0    0    0]
 [4490  180   96 ...,   10    4    6]
 [4795  139    8 ...,    0    0    1]
 ..., 
 [1475   58    3 ...,    0    0    0]
 [4668  256   25 ...,    0    0    0]
 [1955  111   10 ...,    0    0    0]]

Matrix C ((152, 46))

[[ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0 17]
 [ 0  0  0 ...,  0  0  0]
 ..., 
 [ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0  0]]

How can I merge A, B and C correctly with numpy.hstack,scipy.sparse.hstack or FeatureUnion? . Do you guys think this is a correct pipeline-approach to follow for any machine learning task?

like image 645
tumbleweed Avatar asked Aug 29 '15 00:08

tumbleweed


People also ask

Does Sklearn work with sparse matrices?

Sklearn has many algorithms that accept sparse matrices. The way to know is by checking the fit attribute in the documentation. Look for this: X: {array-like, sparse matrix}.

Which of the following is the way to represent sparse matrix?

Sparse Matrix Representations can be done in many ways following are two common representations: Array representation. Linked list representation.

What is a sparse matrix Sklearn?

Matrices that contain mostly zero values are called sparse, distinct from matrices where most of the values are non-zero, called dense.


1 Answers

Is this the right approach to mix several feature extractors in order to yield a big feature matrix?

In terms of correctness of the result, your approach is right, since FeatureUnion runs each individual transformer on the input data and concatenates the resulting matrices horizontally. However, it's not the only way, and which way is better in terms of efficiency will depend on your use case (more on this later).

Assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?

Using FeatureUnion, you simply need to append your new transformer to the transformer list:

custom_vect = YourCustomVectorizer()
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
                                  ("bow", bow),
                                  ("hash", hash_vect),
                                  ("custom", custom_vect])

However, if your input data and most of the transformers are fixed (e.g. when you're experimenting with the inclusion of a new transformer), the above approach will lead to many re-computation. In that case, an alternative approach is to pre-compute store the intermediate results of the transformers (matrices or sparse matrices), and concatenate them manually using numpy.hstack or scipy.sparse.hstack when needed.

If your input data is always changing but the list of transformers is fixed, FeatureUnion offers more convenience. Another advantage of it is that it has the option of n_jobs, which helps you parallelize the fitting process.


Side note: It seems little bit strange to mix hashing vectorizer with the other vectorizers, since hashing vectorizer is typically used when you cannot afford to use the exact versions.

like image 133
YS-L Avatar answered Oct 12 '22 15:10

YS-L