The other day I was dealing with a machine learning task that required to extract several types of feature matrices. I save this feature matrices as numpy arrays in disk in order to later use them in some estimator (this was a classification task). After all, when I wanted to use all the features I just concatenated the matrices in order to have a big feature matrix. When I obtained this big feature matrix I presented it to an estimator.
I do not know if this is the correct way to work with a feature matrix that has a lot of patterns (counts) in it. What other approaches should I use to mix correctly several types of features?. However, looking through the documentation I found FeatureUnion that seems to do this task.
For example, Let's say I would like to create a big feature matrix of 3 vectorizer approaches TfidfVectorizer
, CountVectorizer
and HashingVectorizer
This is what I tried following the documentation example:
#Read the .csv file
import pandas as pd
df = pd.read_csv('file.csv',
header=0, sep=',', names=['id', 'text', 'labels'])
#vectorizer 1
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(use_idf=True, smooth_idf=True,
sublinear_tf=False, ngram_range=(2,2))
#vectorizer 2
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(2,2))
#vectorizer 3
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(ngram_range=(2,2))
#Combine the above vectorizers in one single feature matrix:
from sklearn.pipeline import FeatureUnion
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
("bow", bow),
("hash",hash_vect)])
X_combined_features = combined_features.fit_transform(df['text'].values)
y = df['labels'].values
#Check the matrix
print X_combined_features.toarray()
Then:
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
Split the data:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_combined_features,y, test_size=0.33)
So I have a few questions: Is this the right approach to mix several feature extractors in order to yield a big feature matrix? and assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?.
update
Let's say that I have a matrix like this:
Matrix A ((152, 33)
)
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
Then with my vectorizer that returns a numpy array I get this feature matrix:
Matrix B ((152, 10)
)
[[4210 228 25 ..., 0 0 0]
[4490 180 96 ..., 10 4 6]
[4795 139 8 ..., 0 0 1]
...,
[1475 58 3 ..., 0 0 0]
[4668 256 25 ..., 0 0 0]
[1955 111 10 ..., 0 0 0]]
Matrix C ((152, 46)
)
[[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 17]
[ 0 0 0 ..., 0 0 0]
...,
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]]
How can I merge A, B and C correctly with numpy.hstack
,scipy.sparse.hstack
or FeatureUnion
? . Do you guys think this is a correct pipeline-approach to follow for any machine learning task?
Sklearn has many algorithms that accept sparse matrices. The way to know is by checking the fit attribute in the documentation. Look for this: X: {array-like, sparse matrix}.
Sparse Matrix Representations can be done in many ways following are two common representations: Array representation. Linked list representation.
Matrices that contain mostly zero values are called sparse, distinct from matrices where most of the values are non-zero, called dense.
Is this the right approach to mix several feature extractors in order to yield a big feature matrix?
In terms of correctness of the result, your approach is right, since FeatureUnion
runs each individual transformer on the input data and concatenates the resulting matrices horizontally. However, it's not the only way, and which way is better in terms of efficiency will depend on your use case (more on this later).
Assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?
Using FeatureUnion
, you simply need to append your new transformer to the transformer list:
custom_vect = YourCustomVectorizer()
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
("bow", bow),
("hash", hash_vect),
("custom", custom_vect])
However, if your input data and most of the transformers are fixed (e.g. when you're experimenting with the inclusion of a new transformer), the above approach will lead to many re-computation. In that case, an alternative approach is to pre-compute store the intermediate results of the transformers (matrices or sparse matrices), and concatenate them manually using numpy.hstack
or scipy.sparse.hstack
when needed.
If your input data is always changing but the list of transformers is fixed, FeatureUnion
offers more convenience. Another advantage of it is that it has the option of n_jobs
, which helps you parallelize the fitting process.
Side note: It seems little bit strange to mix hashing vectorizer with the other vectorizers, since hashing vectorizer is typically used when you cannot afford to use the exact versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With