All intermediate steps should be transformers and implement fit and transform

Tags:

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.

m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)

model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)

So X_neware the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.

I am getting the following error:

All intermediate steps should be transformers and implement fit and transform, ExtraTreesClassifier ...

288

asked Feb 13 '18 01:02

Abdul Karim Khan

1 Answers

Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.

sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.

In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.

Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.

What type of classes are able to do transformations?

Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.

Without reading between the lines too much about what you're trying to do here, this would work for you:

First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.

Roughly, that would look like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=444)

sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444), 
                      threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)

model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)

# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)

Two examples for further reading:

Pipelining: chaining a PCA and a logistic regression
Sample pipeline for text feature extraction and evaluation

115

answered Oct 20 '22 07:10

Brad Solomon

Related questions
                            
                                How to fetch all the child nodes of an XML using python?
                            
                                Spacy to extract specific noun phrase
                            
                                Pyspark Removing null values from a column in dataframe
                            
                                Change row values in specific pandas data frame column with python
                            
                                'WinError 10013' running Django on Windows
                            
                                What is the difference between base64 and MIME base 64? [closed]
                            
                                How to strip unicode in a list
                            
                                Concatenate multiple pandas series efficiently
                            
                                Parsing dates in pandas.read_csv with null-value handling?
                            
                                Replace multiple characters in a string at once
                            
                                Calculate Distances Between One Point in Matrix From All Other Points
                            
                                How do I insert highlight or code-block into Sphinx-style docstrings?
                            
                                Writing/Reading special characters from CSV (Python 3.6)
                            
                                Sort string with integers and words without any change in their positions
                            
                                How do I set a custom token for a jupyter notebook?
                            
                                What do the values that `graphviz` renders inside each node of a decision tree mean?
                            
                                Deploy a Python (Dash) app to Heroku using Conda environments (instead of virtualenv)
                            
                                Docker image with python3, chromedriver, chrome & selenium
                            
                                How to use static type checking using Dict with different value types in Python 3.6?
                            
                                ImportError: Failed to import the Cloud Firestore library for Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

All intermediate steps should be transformers and implement fit and transform

Tags:

python

machine-learning

scikit-learn

feature-selection

Abdul Karim Khan

People also ask

1 Answers

Brad Solomon

Recent Activity

Donate For Us