Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

All intermediate steps should be transformers and implement fit and transform

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.

m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)

model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)

So X_neware the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.

I am getting the following error:

All intermediate steps should be transformers and implement fit and transform, ExtraTreesClassifier ...

like image 288
Abdul Karim Khan Avatar asked Feb 13 '18 01:02

Abdul Karim Khan


People also ask

Does the final estimator need to implement all Transformers?

The final estimator only needs to implement fit . The transformers in the pipeline can be cached using memory argument. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

Why do I need transform () and fit () methods in my pipeline?

Like the traceback says: each step in your pipeline needs to have a fit () and transform () method (except the last, which just needs fit (). This is because a pipeline chains together transformations of your data at each step.

What are the intermediate steps in the pipeline?

Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit . The transformers in the pipeline can be cached using memory argument.

Why can't I inspect the transformer instance in the pipeline?

If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline.


1 Answers

Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.

sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.

In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.

Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.

What type of classes are able to do transformations?

  • Ones that scale your data. See preprocessing and normalization.
  • Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.

Without reading between the lines too much about what you're trying to do here, this would work for you:

  1. First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
  2. Build a pipeline that satisfies what your traceback is trying to tell you.
  3. Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.

Roughly, that would look like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=444)

sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444), 
                      threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)

model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)

# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)

Two examples for further reading:

  • Pipelining: chaining a PCA and a logistic regression
  • Sample pipeline for text feature extraction and evaluation
like image 115
Brad Solomon Avatar answered Oct 20 '22 07:10

Brad Solomon