I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification.
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)
process_and_join_features = FeatureUnion(
[
('numeric_features', Pipeline([
('selector', get_numeric_data),
('clf', OneVsRestClassifier(LogisticRegression()))
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer()),
('clf', OneVsRestClassifier(LogisticRegression()))
]))
]
)
In this code 'text' is the text columns and 'num1','num2' are 2 numeric column.
The error message is
TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
steps=[('selector', FunctionTransformer(accept_sparse=False,
func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
inverse_func=None, kw_args=None, pass_y='deprecated',
validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't
Any step I missed?
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely: tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
Feature Unions FeatureUnion combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently.
A FeatureUnion
should be used as a step in the pipeline, not around the pipeline. The error you are getting is because you have a Classifier not as the final step - the union tries to call fit
and transform
on all transformers and a classifier does not have a transform
method.
Simply rework to have an outer pipeline with the classifier as the final step:
process_and_join_features = Pipeline([
('features', FeatureUnion([
('numeric_features', Pipeline([
('selector', get_numeric_data)
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer())
]))
])),
('clf', OneVsRestClassifier(LogisticRegression()))
])
Also see here for a good example on the scikit-learn website doing this sort of thing.
While I believe @Ken Syme correctly identified the problem and provided a fix for what you intend to do. However, just in case you actually intend to use the output of the classifier as a feature for a higher level model, check out this blog.
Using the ModelTransformer by Zac, you can have your pipe as follows:
class ModelTransformer(TransformerMixin):
def __init__(self, model):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
return DataFrame(self.model.predict(X))
process_and_join_features = FeatureUnion(
[
('numeric_features', Pipeline([
('selector', get_numeric_data),
('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer()),
('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
]))
]
)
Depending on your concrete next steps you still may have to wrap the FeatureUnion in a Pipeline (e.g. using the shortcut make_pipeline).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With