Writing my first pipeline for sk-learn I stumbled upon some issues when only a subset of columns is put into a pipeline:
mydf = pd.DataFrame({'classLabel':[0,0,0,1,1,0,0,0],
'categorical':[7,8,9,5,7,5,6,4],
'numeric1':[7,8,9,5,7,5,6,4],
'numeric2':[7,8,9,5,7,5,6,"N.A"]})
columnsNumber = ['numeric1']
XoneColumn = X[columnsNumber]
I use the functionTransformer
like:
def extractSpecificColumn(X, columns):
return X[columns]
pipeline = Pipeline([
('features', FeatureUnion([
('continuous', Pipeline([
('numeric', FunctionTransformer(columnsNumber)),
('scale', StandardScaler())
]))
], n_jobs=1)),
('estimator', RandomForestClassifier(n_estimators=50, criterion='entropy', n_jobs=-1))
])
cv.cross_val_score(pipeline, XoneColumn, y, cv=folds, scoring=kappaScore)
This results in: TypeError: 'list' object is not callable
when the function transformer is enabled.
If I instantiate a ColumnExtractor
like below no error is returned. But isn't the functionTransformer
meant just for simple cases like this one and should just work?
class ColumnExtractor(TransformerMixin):
def __init__(self, columns):
self.columns = columns
def transform(self, X, *_):
return X[self.columns]
def fit(self, *_):
return self
A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.
The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).
The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.
FunctionTransformer
is used to "lift" a function to a transformation which I think can help with some data cleaning steps. Imagine you have a mostly numeric array and you want to transform it with a Transformer that that will error out if it gets a nan
(like Normalize
). You might end up with something like
df.fillna(0, inplace=True)
...
cross_val_score(pipeline, ...)
but maybe you that fillna
is only required in one transformation so instead of having the fillna
like above, you have
normalize = make_pipeline(
FunctionTransformer(np.nan_to_num, validate=False),
Normalize()
)
which ends up normalizing it as you want. Then you can use that snippet in more places without littering your code with .fillna(0)
In your example, you're passing in ['numeric1']
which is a list
and not an extractor like the similarly typed df[['numeric1']]
. What you may want instead is more like
FunctionTransformer(operator.itemgetter(columns))
but that still wont work because the object that is ultimately passed into the FunctionTransformer will be an np.array
and not a DataFrame
.
In order to do operations on particular columns of a DataFrame
, you may want to use a library like sklearn-pandas which allows you to define particular transformers by column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With