Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn pipeline - Applying sample weights after applying a polynomial feature transformation in a pipeline

I want to apply sample weights and at the same time use a pipeline from sklearn which should make a feature transformation, e.g. polynomial, and then apply a regressor, e.g. ExtraTrees.

I am using the following packages in the two examples below:

from sklearn.ensemble import ExtraTreesRegressor
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

Everything works well as long as I seperately transform the features and generate and train the model afterwards:

#Feature generation
X = np.random.rand(200,4)
Y = np.random.rand(200)

#Feature transformation
poly = PolynomialFeatures(degree=2)
poly.fit_transform(X)

#Model generation and fit
clf = ExtraTreesRegressor(n_estimators=5, max_depth = 3)
weights = [1]*100 + [2]*100
clf.fit(X,Y, weights)

But doing it in a pipeline, does not work:

#Pipeline generation
pipe = Pipeline([('poly2', PolynomialFeatures(degree=2)), ('ExtraTrees', ExtraTreesRegressor(n_estimators=5, max_depth = 3))])

#Feature generation
X = np.random.rand(200,4)
Y = np.random.rand(200)

#Fitting model
clf = pipe
weights = [1]*100 + [2]*100
clf.fit(X,Y, weights)

I get the following error: TypeError: fit() takes at most 3 arguments (4 given) In this simple example, it is no issue to modify the code, but when I want to run several different tests on my real data in my real code, being able to use pipelines and sample weight

like image 657
stefanE Avatar asked Mar 24 '16 16:03

stefanE


1 Answers

There is mention of **fit_params in the fit method of Pipeline documentation. You must specify which step of the pipeline you want to apply the parameter to. You can achieve this by following the naming rules in the docs:

For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below.

So all that being said, try changing the last line to:

clf.fit(X,Y, **{'ExtraTrees__sample_weight': weights})

This is a good example of how to work with parameters in pipelines.

like image 120
Kevin Avatar answered Nov 07 '22 18:11

Kevin