Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn function transformer in pipeline

Tags:

Writing my first pipeline for sk-learn I stumbled upon some issues when only a subset of columns is put into a pipeline:

mydf = pd.DataFrame({'classLabel':[0,0,0,1,1,0,0,0],
                   'categorical':[7,8,9,5,7,5,6,4],
                   'numeric1':[7,8,9,5,7,5,6,4],
                   'numeric2':[7,8,9,5,7,5,6,"N.A"]})
columnsNumber = ['numeric1']
XoneColumn = X[columnsNumber]

I use the functionTransformer like:

def extractSpecificColumn(X, columns):
    return X[columns]

pipeline = Pipeline([
    ('features', FeatureUnion([
        ('continuous', Pipeline([
            ('numeric', FunctionTransformer(columnsNumber)),
            ('scale', StandardScaler())
        ]))
    ], n_jobs=1)),
    ('estimator', RandomForestClassifier(n_estimators=50, criterion='entropy', n_jobs=-1))
])

cv.cross_val_score(pipeline, XoneColumn, y, cv=folds, scoring=kappaScore)

This results in: TypeError: 'list' object is not callable when the function transformer is enabled.

edit:

If I instantiate a ColumnExtractor like below no error is returned. But isn't the functionTransformer meant just for simple cases like this one and should just work?

class ColumnExtractor(TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def transform(self, X, *_):
        return X[self.columns]

    def fit(self, *_):
        return self
like image 242
Georg Heiler Avatar asked Sep 09 '16 07:09

Georg Heiler


People also ask

What is function transformer in Sklearn?

A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.

What's the difference between pipeline () and Make_pipeline () from Sklearn library?

The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).

Is ColumnTransformer defined in Scikit-learn a pipeline?

The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.


1 Answers

FunctionTransformer is used to "lift" a function to a transformation which I think can help with some data cleaning steps. Imagine you have a mostly numeric array and you want to transform it with a Transformer that that will error out if it gets a nan (like Normalize). You might end up with something like

df.fillna(0, inplace=True)
...
cross_val_score(pipeline, ...)

but maybe you that fillna is only required in one transformation so instead of having the fillna like above, you have

normalize = make_pipeline(
    FunctionTransformer(np.nan_to_num, validate=False),
    Normalize()
)

which ends up normalizing it as you want. Then you can use that snippet in more places without littering your code with .fillna(0)

In your example, you're passing in ['numeric1'] which is a list and not an extractor like the similarly typed df[['numeric1']]. What you may want instead is more like

FunctionTransformer(operator.itemgetter(columns))

but that still wont work because the object that is ultimately passed into the FunctionTransformer will be an np.array and not a DataFrame.

In order to do operations on particular columns of a DataFrame, you may want to use a library like sklearn-pandas which allows you to define particular transformers by column.

like image 151
Alex Riina Avatar answered Sep 24 '22 16:09

Alex Riina