Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does imblearn pipeline turn off sampling for testing?

Let us suppose the following code (from imblearn example on pipelines)

...    
# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)

pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)

I want to make it sure that when executing the pipeline.predict(X_test) the sampling procedures enn and renn will not be executed (but of course the pca must be executed).

  1. First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to the test/validation set. Please correct me here if I am wrong.

  2. I browsed though the imblearn Pipeline code but I could not find the predict method there.

  3. I also would like to be sure that this correct behavior works when the pipeline is inside a GridSearchCV

I just need some assurance that this is what happens with the imblearn.Pipeline.

EDIT: 2020-08-28

@wundermahn answer is all I needed.

This edit is just to add that this is the reason one should use the imblearn.Pipeline for imbalanced pre-processing and not sklearn.Pipeline Nowhere in the imblearn documentation I found an explanation why the need for imblearn.Pipeline when there is sklearn.Pipeline

like image 551
Jacques Wainer Avatar asked Aug 21 '20 10:08

Jacques Wainer


1 Answers

Great question(s). To go through them in the order you posted:

  1. First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to the test/validation set. Please correct me here if I am wrong.

That is correct. You certainly do not want to test (whether that be on your test or validation data) on data that is not representative of the actual, live, "production" dataset. You should really only apply this to training. Please note, that if you are using techniques like cross-fold validation, you should apply the sampling to each fold individually, as indicated by this IEEE paper.

  1. I browsed though the imblearn Pipeline code but I could not find the predict method there.

I'm assuming you found the imblearn.pipeline source code, and so if you did, what you want to do is take a look at the fit_predict method:

 @if_delegate_has_method(delegate="_final_estimator")
    def fit_predict(self, X, y=None, **fit_params):
        """Apply `fit_predict` of last step in pipeline after transforms.
        Applies fit_transforms of a pipeline to the data, followed by the
        fit_predict method of the final estimator in the pipeline. Valid
        only if the final estimator implements fit_predict.
        Parameters
        ----------
        X : iterable
            Training data. Must fulfill input requirements of first step of
            the pipeline.
        y : iterable, default=None
            Training targets. Must fulfill label requirements for all steps
            of the pipeline.
        **fit_params : dict of string -> object
            Parameters passed to the ``fit`` method of each step, where
            each parameter name is prefixed such that parameter ``p`` for step
            ``s`` has key ``s__p``.
        Returns
        -------
        y_pred : ndarray of shape (n_samples,)
            The predicted target.
        """
        Xt, yt, fit_params = self._fit(X, y, **fit_params)
        with _print_elapsed_time('Pipeline',
                                 self._log_message(len(self.steps) - 1)):
            y_pred = self.steps[-1][-1].fit_predict(Xt, yt, **fit_params)
        return y_pred

Here, we can see that the pipeline utilizes the .predict method of the final estimator in the pipeline, in the example you posted, scikit-learn's knn:

 def predict(self, X):
        """Predict the class labels for the provided data.
        Parameters
        ----------
        X : array-like of shape (n_queries, n_features), \
                or (n_queries, n_indexed) if metric == 'precomputed'
            Test samples.
        Returns
        -------
        y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
            Class labels for each data sample.
        """
        X = check_array(X, accept_sparse='csr')

        neigh_dist, neigh_ind = self.kneighbors(X)
        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]

        n_outputs = len(classes_)
        n_queries = _num_samples(X)
        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_queries, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            if weights is None:
                mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
            else:
                mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)

            mode = np.asarray(mode.ravel(), dtype=np.intp)
            y_pred[:, k] = classes_k.take(mode)

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred
  1. I also would like to be sure that this correct behaviour works when the pipeline is inside a GridSearchCV

This sort of assumes the above two assumptions are true, and I am taking this to mean you want a complete, minimal, reproducible example of this working in a GridSearchCV. There is extensive documentation from scikit-learn on this, but an example I created using knn is below:

import pandas as pd, numpy as np

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split

param_grid = [
    {
        'classification__n_neighbors': [1,3,5,7,10],
    }
]

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20)

pipe = Pipeline([
    ('sampling', SMOTE()),
    ('classification', KNeighborsClassifier())
])

grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)
mean_scores = np.array(grid.cv_results_['mean_test_score'])
print(mean_scores)

# [0.98051926 0.98121129 0.97981998 0.98050474 0.97494193]

Your intuition was spot on, good job :)

like image 104
artemis Avatar answered Nov 14 '22 22:11

artemis