Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross Validating With Imblearn Pipeline And GridSearchCV

I'm trying to use the Pipeline class from imblearn and GridSearchCV to get the best parameters for classifying the imbalanced dataset. As per the answers mentioned here, I want to leave out resampling of the validation set and only resample the training set, which imblearn's Pipeline seems to be doing. However, I'm getting an error while implementing the accepted solution. Please let me know what am I doing wrong. Below is my implementation:

def imb_pipeline(clf, X, y, params):

    model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', clf)
    ])

    score={'AUC':'roc_auc', 
           'RECALL':'recall',
           'PRECISION':'precision',
           'F1':'f1'}

    gcv = GridSearchCV(estimator=model, param_grid=params, cv=5, scoring=score, n_jobs=12, refit='F1',
                       return_train_score=True)
    gcv.fit(X, y)

    return gcv

for param, classifier in zip(params, classifiers):
    print("Working on {}...".format(classifier[0]))
    clf = imb_pipeline(classifier[1], X_scaled, y, param) 
    print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
    print("Best `F1` for {} is {}".format(classifier[0], clf.best_score_))
    print('-'*50)
    print('\n')

params:

[{'penalty': ('l1', 'l2'), 'C': (0.01, 0.1, 1.0, 10)},
 {'n_neighbors': (10, 15, 25)},
 {'n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]

classifiers:

[('Logistic Regression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=None, solver='warn', tol=0.0001, verbose=0,
                     warm_start=False)),
 ('KNearestNeighbors',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                       weights='uniform')),
 ('Gradient Boosting Classifier',
  GradientBoostingClassifier(criterion='friedman_mse', init=None,
                             learning_rate=0.1, loss='deviance', max_depth=3,
                             max_features=None, max_leaf_nodes=None,
                             min_impurity_decrease=0.0, min_impurity_split=None,
                             min_samples_leaf=1, min_samples_split=2,
                             min_weight_fraction_leaf=0.0, n_estimators=100,
                             n_iter_no_change=None, presort='auto',
                             random_state=None, subsample=1.0, tol=0.0001,
                             validation_fraction=0.1, verbose=0,
                             warm_start=False))]

Error:

ValueError: Invalid parameter C for estimator Pipeline(memory=None,
         steps=[('sampling',
                 SMOTE(k_neighbors=5, kind='deprecated',
                       m_neighbors='deprecated', n_jobs=1,
                       out_step='deprecated', random_state=None, ratio=None,
                       sampling_strategy='auto', svm_estimator='deprecated')),
                ('classification',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='warn', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False). Check the list of available parameters with `estimator.get_params().keys()`. """
like image 253
Krishnang K Dalal Avatar asked Nov 12 '19 08:11

Krishnang K Dalal


People also ask

Does GridSearchCV do cross-validation?

Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it. So you train your models against train data set and test them on a testing data set.

Can pipeline be used to chain multiple estimators into one?

Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification.

How does GridSearchCV work?

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.

What is pipeline in scikit learn?

Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once ( fit() , predict() , etc).


1 Answers

Please check this example how to use parameters with a Pipeline: - https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py

Whenever using the pipeline, you will need to send the parameters in a way so that pipeline can understand which parameter is for which of the step in the list. For that it uses the name you provided during Pipeline initialisation.

In your code, for example:

model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', clf)
    ])

To pass the parameter p1 to SMOTE you would use sampling__p1 as a parameter, not p1.

You used "classification" as a name for your clf so append that to the parameters which are supposed to go to the clf.

Try:

[{'classification__penalty': ('l1', 'l2'), 'classification__C': (0.01, 0.1, 1.0, 10)},
 {'classification__n_neighbors': (10, 15, 25)},
 {'classification__n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]

Make sure there are two underscores between the name and the parameter.

like image 121
Vivek Kumar Avatar answered Sep 28 '22 15:09

Vivek Kumar