Use a metric after a classifier in a Pipeline

Tags:

I continue to investigate about pipeline. My aim is to execute each step of machine learning only with pipeline. It will be more flexible and easier to adapt my pipeline with an other use case. So what I do:

Step 1: Fill NaN Values
Step 2: Transforming Categorical Values into Numbers
Step 3: Classifier
Step 4: GridSearch
Step 5: Add a metrics (failed)

Here is my code:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score


class FillNa(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
            non_numerics_columns = x.columns.difference(
                x._get_numeric_data().columns)
            for column in x.columns:
                if column in non_numerics_columns:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        df[column].value_counts().idxmax())
                else:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        x.loc[:, column].mean())
            return x

    def fit(self, x, y=None):
        return self


class CategoricalToNumerical(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
        non_numerics_columns = x.columns.difference(
            x._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            x.loc[:, column] = x.loc[:, column].fillna(
                x.loc[:, column].value_counts().idxmax())
            le.fit(x.loc[:, column])
            x.loc[:, column] = le.transform(x.loc[:, column]).astype(int)
        return x

    def fit(self, x, y=None):
        return self


class Perf(BaseEstimator, TransformerMixin):

    def fit(self, clf, x, y, perf="all"):
        """Only for classifier model.

        Return AUC, ROC, Confusion Matrix and F1 score from a classifier and df
        You can put a list of eval instead a string for eval paramater.
        Example: eval=['all', 'auc', 'roc', 'cm', 'f1'] will return these 4
        evals.
        """
        evals = {}
        y_pred_proba = clf.predict_proba(x)[:, 1]
        y_pred = clf.predict(x)
        perf_list = perf.split(',')
        if ("all" or "roc") in perf.split(','):
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            roc_auc = round(auc(fpr, tpr), 3)
            plt.style.use('bmh')
            plt.figure(figsize=(12, 9))
            plt.title('ROC Curve')
            plt.plot(fpr, tpr, 'b',
                     label='AUC = {}'.format(roc_auc))
            plt.legend(loc='lower right', borderpad=1, labelspacing=1,
                       prop={"size": 12}, facecolor='white')
            plt.plot([0, 1], [0, 1], 'r--')
            plt.xlim([-0.1, 1.])
            plt.ylim([-0.1, 1.])
            plt.ylabel('True Positive Rate')
            plt.xlabel('False Positive Rate')
            plt.show()

        if "all" in perf_list or "auc" in perf_list:
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            evals['auc'] = auc(fpr, tpr)

        if "all" in perf_list or "cm" in perf_list:
            evals['cm'] = confusion_matrix(y, y_pred)

        if "all" in perf_list or "f1" in perf_list:
            evals['f1'] = f1_score(y, y_pred)

        return evals


path = '~/proj/akd-doc/notebooks/data/'
df = pd.read_csv(path + 'titanic_tuto.csv', sep=';')
y = df.pop('Survival-Status').replace(to_replace=['dead', 'alive'],
                                      value=[0., 1.])
X = df.copy()
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), test_size=0.2, random_state=42)

percent = 0.50
nb_features = round(percent * df.shape[1]) + 1
clf = RandomForestClassifier()
pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf),
                     ('perf', Perf())])

params = dict(random_forest__max_depth=list(range(8, 12)),
              random_forest__n_estimators=list(range(30, 110, 10)))
cv = GridSearchCV(pipeline, param_grid=params)
cv.fit(X_train, y_train)

I am aware that it is not ideal to print a roc curve but that's not the problem right now.

So, when I execute this code I have:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('fillna', FillNa()), ('categorical_to_numerical', CategoricalToNumerical()), ('features_selection', SelectKBest(k=10, score_func=<function f_classif at 0x7f4ed4c3eae8>)), ('random_forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None,...=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)), ('perf', Perf())]) does not.

I'm interested in all ideas...

355

asked May 04 '17 15:05

Jeremie Guez

1 Answers

As the error states, you need to specify the scoring parameter in GridSearchCV.

Use

GridSearchCV(pipeline, param_grid=params, scoring = 'accuracy')

Edit (Based on questions in comments):

If you need the roc, auc curve and f1 for the entire X_train and y_train (and not for all the splits of GridSearchCV), its better to keep the Perf class out of the pipeline.

pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf)])

#Fit the data in the pipeline
pipeline.fit(X_train, y_train)

performance_meas = Perf()
performance_meas.fit(pipeline, X_train, y_train)

answered Sep 27 '22 18:09

Vivek Kumar

Related questions
                            
                                What's the order Python used to import module?
                            
                                Python relative import with more than two dots
                            
                                Python3: print(somestring,end='\r', flush=True) shows nothing
                            
                                Python number print formatting using %.*g
                            
                                Why isn't this alternative to the deprecated Factory.set_creation_function working with nosetests?
                            
                                How to use Python 3.5 style async and await in Tornado for websockets?
                            
                                Python, Seaborn - How to add significance bars and asterisks to boxplots
                            
                                Boost.Python add bindings to existing PyObject (for exception handling)
                            
                                Flask with Celery - Application context not available
                            
                                Getting python's itertools cycle current element
                            
                                Do threads in python need to be joined to avoid leakage?
                            
                                Pointing bash to a python installed on windows
                            
                                Keras import error Nadam
                            
                                drawing svg in python with paths not shapes or convert them
                            
                                Python asyncio program won't exit
                            
                                Time complexity for lookup in dictionary.values() lists vs sets [duplicate]
                            
                                Python pandas: conditionally select a uniform sample from a dataframe
                            
                                What are the internals of Pythons str.join()? (Hiding passwords from output)
                            
                                Update labels in a separate worker (Process instance)
                            
                                GAE REQUEST_LOG_ID env variable is wrong

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use a metric after a classifier in a Pipeline

Tags:

python

machine-learning

scikit-learn

pipeline

grid-search

Jeremie Guez

People also ask

1 Answers

Vivek Kumar

Recent Activity

Donate For Us