I've just discovered the Pipeline feature of scikit-learn, and I find it very useful for testing different combinations of preprocessing steps before training my model.
A pipeline is a chain of objects that implement the fit
and transform
methods. Now, if I wanted to add a new preprocessing step, I used to write a class that inherits from sklearn.base.estimator
. However, I'm thinking that there must be a simpler method. Do I really need to wrap every function I want to apply in an estimator class?
Example:
class Categorizer(sklearn.base.BaseEstimator):
"""
Converts given columns into pandas dtype 'category'.
"""
def __init__(self, columns):
self.columns = columns
def fit(self, X, y):
return self
def transform(self, X):
for column in self.columns:
X[column] = X[column].astype("category")
return X
For a general solution (working for many other use cases, not just transformers, but also simple models etc.), you can write your own decorator if you have state-free functions (which do not implement fit), for example by doing:
class TransformerWrapper(sklearn.base.BaseEstimator):
def __init__(self, func):
self._func = func
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return self._func(X, *args, **kwargs)
and now you can do
@TransformerWrapper
def foo(x):
return x*2
which is equivalent of doing
def foo(x):
return x*2
foo = TransformerWrapper(foo)
which is what sklearn.preprocessing.FunctionTransformer is doing under the hood.
Personally I find decorating simpler, since you have a nice separation of your preprocessors from the rest of the code, but it is up to you which path to follow.
In fact you should be able to decorate with sklearn function by
from sklearn.preprocessing import FunctionTransformer
@FunctionTransformer
def foo(x):
return x*2
too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With