I'm working on making a DataFrame pre-processing pipeline using sklearn and chaining various types of pre-processing steps.
I wanted to chain a SimpleImputer transformer and a FunctionTransformer applying a pd.qcut (or pd.cut) but I keep getting the following error:
ValueError: Input array must be 1 dimensional
Here's my code:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self._features = features
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return X[self._features]
fare_transformer = Pipeline([
    ('fare_selector', FeatureSelector(['Fare'])),
    ('fare_imputer', SimpleImputer(strategy='median')),
    ('fare_bands', FunctionTransformer(func=pd.qcut, kw_args={'q': 5}))
])
The same happens if I simply chain the FeatureSelector transformer and the FunctionTransformer with pd.qcut and omit the SimpleImputer:
fare_transformer = Pipeline([
    ('fare_selector', FeatureSelector(['Fare'])),
    ('fare_bands', FunctionTransformer(func=pd.qcut, kw_args={'q': 5}))
])
I searched stackoverflow and google extensively but could not find a solution to this issue. Any help here would be greatly appreciated!
sklearn already has such a transformer, KBinsDiscretizer (to match pd.qcut, use strategy='quantile').  It will differ primarily in how it transforms test data: the FunctionTransformer version will "refit" the quantiles, whereas the builtin KBinsDiscretizer will save the quantile statistics for binning test data.  As @m_power notes in a comment, they also differ near bin edges, as well as the format of the transformed data.
But to address the error specifically: it means your function qcut only applies to a 1D array, whereas FunctionTransformer sends the entire dataframe.  You can define a thin wrapper around qcut to make this work, like
def frame_qcut(X, y=None, q=10):
    return X.apply(pd.qcut, axis=0, q=q)
(That's assuming you'll get a dataframe in.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With