Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sci-kit: What's the easiest way to get the confusion matrix of an estimator when using GridSearchCV?

In this simplified example, I've trained a learner with GridSearchCV. I would like to return the confusion matrix of the best learner when predicting on the full set X.

lr_pipeline = Pipeline([('clf', LogisticRegression())])
lr_parameters = {}

lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)
lr_gs = lr_gs.fit(X,y)

print lr_gs.confusion_matrix # Would like to be able to do this

Thanks

like image 536
Zachary Nagler Avatar asked Mar 22 '16 21:03

Zachary Nagler


2 Answers

I found this question while searching for how to calculate the confusion matrix while fitting Sci-kit Learn's GridSearchCV. I was able to find a solution by defining a custom scoring function, although it's somewhat kludgy. I'm leaving this answer for anyone else who makes a similar search.

As mentioned by @MLgeek and @bugo99iot, the accepted answer by @Sudeep Juvekar isn't really satisfactory. It offers a literal answer to original question as asked, but it's not usually the case that a machine learning practitioner would be interested in the confusion matrix of a fitted model on its training data. It is more typically of interest to know how well a model generalizes to data it hasn't seen.

To use a custom scoring function in GridSearchCV you will need to import the Scikit-learn helper function make_scorer.

from sklearn.metrics import make_scorer

The custom scoring function looks like this

def _count_score(y_true, y_pred, label1=0, label2=1):
    return sum((y == label1 and pred == label2)
                for y, pred in zip(y_true, y_pred))

For a given pair of labels, (label1, label2), it calculates the number of examples where the true value of y is label1 and the predicted value of y is label2.

To start, find all of the labels in the training data

all_labels = sorted(set(y))

The optional argument scoring of GridSearchCV can receive a dictionary mapping strings to scorers. make_scorer can take a scoring function along with bindings for some of its parameters and produce a scorer, which is a particular type of callable that is used for scoring in GridSearchCV, cross_val_score, etc. Let's build up this dictionary for each pair of labels.

scorer = {}
for label1 in all_labels:
    for label2 in all_labels:
        count_score = make_scorer(_count_score, label1=label1,
                                  label2=label2)
        scorer['count_%s_%s' % (label1, label2)] = count_score

You'll also want to add any additional scoring functions you're interested in. To avoid getting into the subtleties of scoring for multi-class classification let's add a simple accuracy score.

# import placed here for the sake of demonstration.
# Should be imported alongside make_scorer above
from sklearn.metrics import accuracy_score

scorer['accuracy'] = make_scorer(accuracy_score)

We can now fit GridSearchCV

num_splits = 5
lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1,
                     scoring=scorer, refit='accuracy',
                     cv=num_splits)

refit='accuracy' tells GridSearchCV that it should judge by best accuracy score to decide on the parameters to use when refitting. In the case where you are passing a dictionary of multiple scorers to scoring, if you do not pass a value to the optional argument refit, GridSearchCV will not refit the model on all training data. We've explicitly set the number of splits because we'll need to know this later.

Now, for each of the training folds used in cross-validation, essentially what we've done is calculate the confusion matrix on the respective test folds. The test folds do not overlap and cover the entire space of data, we've therefore made predictions for each data point in X in such a way that the prediction for each point does not depend on the associated target label for that point.

We can add up the confusion matrices associated to the test folds to get something useful that gives information on how well the model generalizes. It can also be interesting to look at the confusion matrices for the test folds separately and do stuff like calculate variances.

We're not done yet though. We need to actually pull out the confusion matrix for the best estimator. In this example, the cross validation results will be stored in the dictionary lr_gs.cv_results. First let's get the index in the results corresponding to the best set of parameters

best_index = lr_gs.cv_results['rank_test_accuracy'] - 1

If you are using a different metric to decide upon the best parameters, substitute for 'accuracy' the key you are using for the associated scorer in the scoring dictionary passed to GridSearchCV.

In my own application I chose to store the confusion matrix as a nested dictionary.

confusion = defaultdict(lambda: defaultdict(int))
for label1 in all_labels:
    for label2 in all_labels
        for i in range(num_splits):
            key = 'split%s_test_count_%s_%s' % (i, label1, label2)
            val = int(lr_gs.cv_results[key][best_index])
            confusion[label1][label2] += val
confusion = {key: dict(value) for key, value in confusion.items()}

There's some stuff to unpack here. defaultdict(lambda: defaultdict(int)) constructs a nested defaultdict; a defaultdict of defaultdict of int (if you're copying and pasting, don't forget to add from collections import defaultdict at the top of your file). The last line of this snippet is used to turn confusion into a regular dict of dict of int. Never leave defaultdicts lying around when they are no longer needed.

You will likely want to store your confusion matrix in a different way. The key fact is that the confusion matrix entry for the pair of labels 'label1', 'label2' for test fold i is stored in

lr_gs.cv_results['spliti_label1_label2'][best_index]

See here for an example of this confusion matrix calculation used in practice. I think it's a bit of a code smell to rely on the specific format of the keys in the cv_results dictionary but this does work, at least as of the day of this post.

like image 98
Albert Steppi Avatar answered Nov 07 '22 21:11

Albert Steppi


You will first need to predict using best estimator in your GridSerarchCV. A common method to use is GridSearchCV.decision_function(), But for your example, decision_function returns class probabilities from LogisticRegression and does not work with confusion_matrix. Instead, find best estimator using lr_gs and predict the labels using that estimator.

y_pred = lr_gs.best_estimator_.predict(X)

Finally, use sklearn's confusion_matrix on real and predicted y

from sklearn.metrics import confusion_matrix
print confusion_matrix(y, y_pred)
like image 35
Sudeep Juvekar Avatar answered Nov 07 '22 21:11

Sudeep Juvekar