Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use a metric after a classifier in a Pipeline

I continue to investigate about pipeline. My aim is to execute each step of machine learning only with pipeline. It will be more flexible and easier to adapt my pipeline with an other use case. So what I do:

  • Step 1: Fill NaN Values
  • Step 2: Transforming Categorical Values into Numbers
  • Step 3: Classifier
  • Step 4: GridSearch
  • Step 5: Add a metrics (failed)

Here is my code:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score


class FillNa(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
            non_numerics_columns = x.columns.difference(
                x._get_numeric_data().columns)
            for column in x.columns:
                if column in non_numerics_columns:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        df[column].value_counts().idxmax())
                else:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        x.loc[:, column].mean())
            return x

    def fit(self, x, y=None):
        return self


class CategoricalToNumerical(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
        non_numerics_columns = x.columns.difference(
            x._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            x.loc[:, column] = x.loc[:, column].fillna(
                x.loc[:, column].value_counts().idxmax())
            le.fit(x.loc[:, column])
            x.loc[:, column] = le.transform(x.loc[:, column]).astype(int)
        return x

    def fit(self, x, y=None):
        return self


class Perf(BaseEstimator, TransformerMixin):

    def fit(self, clf, x, y, perf="all"):
        """Only for classifier model.

        Return AUC, ROC, Confusion Matrix and F1 score from a classifier and df
        You can put a list of eval instead a string for eval paramater.
        Example: eval=['all', 'auc', 'roc', 'cm', 'f1'] will return these 4
        evals.
        """
        evals = {}
        y_pred_proba = clf.predict_proba(x)[:, 1]
        y_pred = clf.predict(x)
        perf_list = perf.split(',')
        if ("all" or "roc") in perf.split(','):
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            roc_auc = round(auc(fpr, tpr), 3)
            plt.style.use('bmh')
            plt.figure(figsize=(12, 9))
            plt.title('ROC Curve')
            plt.plot(fpr, tpr, 'b',
                     label='AUC = {}'.format(roc_auc))
            plt.legend(loc='lower right', borderpad=1, labelspacing=1,
                       prop={"size": 12}, facecolor='white')
            plt.plot([0, 1], [0, 1], 'r--')
            plt.xlim([-0.1, 1.])
            plt.ylim([-0.1, 1.])
            plt.ylabel('True Positive Rate')
            plt.xlabel('False Positive Rate')
            plt.show()

        if "all" in perf_list or "auc" in perf_list:
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            evals['auc'] = auc(fpr, tpr)

        if "all" in perf_list or "cm" in perf_list:
            evals['cm'] = confusion_matrix(y, y_pred)

        if "all" in perf_list or "f1" in perf_list:
            evals['f1'] = f1_score(y, y_pred)

        return evals


path = '~/proj/akd-doc/notebooks/data/'
df = pd.read_csv(path + 'titanic_tuto.csv', sep=';')
y = df.pop('Survival-Status').replace(to_replace=['dead', 'alive'],
                                      value=[0., 1.])
X = df.copy()
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), test_size=0.2, random_state=42)

percent = 0.50
nb_features = round(percent * df.shape[1]) + 1
clf = RandomForestClassifier()
pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf),
                     ('perf', Perf())])

params = dict(random_forest__max_depth=list(range(8, 12)),
              random_forest__n_estimators=list(range(30, 110, 10)))
cv = GridSearchCV(pipeline, param_grid=params)
cv.fit(X_train, y_train)

I am aware that it is not ideal to print a roc curve but that's not the problem right now.

So, when I execute this code I have:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('fillna', FillNa()), ('categorical_to_numerical', CategoricalToNumerical()), ('features_selection', SelectKBest(k=10, score_func=<function f_classif at 0x7f4ed4c3eae8>)), ('random_forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None,...=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)), ('perf', Perf())]) does not.

I'm interested in all ideas...

like image 355
Jeremie Guez Avatar asked May 04 '17 15:05

Jeremie Guez


People also ask

Why is implementing a pipeline helpful when you are using cross validation?

Cross-Validation: Pipelines help to avoid data leakage from the testing data into the trained model during cross-validation. This is achieved by ensuring that the same samples are used to train the transformers and predictors.

What is pipeline classifier?

For Classification pipelines, the groups of users you upload (in ML terms, your positive and negative labels) will consist of sets of users who are known to share a particular trait, i.e. the positive set, as well as a set of users who are known to not share this particular trait, i.e. the negative set.

How does pipeline work Sklearn?

Python scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.


1 Answers

As the error states, you need to specify the scoring parameter in GridSearchCV.

Use

GridSearchCV(pipeline, param_grid=params, scoring = 'accuracy')

Edit (Based on questions in comments):

If you need the roc, auc curve and f1 for the entire X_train and y_train (and not for all the splits of GridSearchCV), its better to keep the Perf class out of the pipeline.

pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf)])

#Fit the data in the pipeline
pipeline.fit(X_train, y_train)

performance_meas = Perf()
performance_meas.fit(pipeline, X_train, y_train)
like image 85
Vivek Kumar Avatar answered Sep 27 '22 18:09

Vivek Kumar