Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to featureUnion numerical and text features in python sklearn properly

I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification.

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion

get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)

process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ]))
         ]
    )

In this code 'text' is the text columns and 'num1','num2' are 2 numeric column.

The error message is

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
 steps=[('selector', FunctionTransformer(accept_sparse=False,
      func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
      inverse_func=None, kw_args=None, pass_y='deprecated',
      validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
      verbose=0, warm_start=False),
      n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't

Any step I missed?

like image 697
santoku Avatar asked Dec 11 '17 01:12

santoku


People also ask

Which methods does the Scikit-learn package provide to extract numerical features from texts?

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely: tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.

What is feature Union Sklearn?

Feature Unions FeatureUnion combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently.


2 Answers

A FeatureUnion should be used as a step in the pipeline, not around the pipeline. The error you are getting is because you have a Classifier not as the final step - the union tries to call fit and transform on all transformers and a classifier does not have a transform method.

Simply rework to have an outer pipeline with the classifier as the final step:

process_and_join_features = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data)
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer())
            ]))
         ])),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])

Also see here for a good example on the scikit-learn website doing this sort of thing.

like image 122
Ken Syme Avatar answered Oct 15 '22 16:10

Ken Syme


While I believe @Ken Syme correctly identified the problem and provided a fix for what you intend to do. However, just in case you actually intend to use the output of the classifier as a feature for a higher level model, check out this blog.

Using the ModelTransformer by Zac, you can have your pipe as follows:

class ModelTransformer(TransformerMixin):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return DataFrame(self.model.predict(X))


process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ]))
         ]
)

Depending on your concrete next steps you still may have to wrap the FeatureUnion in a Pipeline (e.g. using the shortcut make_pipeline).

like image 36
Marcus V. Avatar answered Oct 15 '22 16:10

Marcus V.