Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to graph whether overfitting takes place in a multiclass classifier

I want to monitor the loss during the training of a multiclass Gradient Boosting Classifier as a way to know whether there is overfitting taking place or not. Here is my code:

%matplotlib inline
import numpy as np
#import matplotlib.pyplot as plt
import matplotlib.pylab as plt
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

iris = datasets.load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

n_est = 100
clf = GradientBoostingClassifier(n_estimators=n_est, max_depth=3, random_state=2)
clf.fit(X_train, y_train)


test_score = np.empty(len(clf.estimators_))
for i, pred in enumerate(clf.staged_predict(X_test)):
    test_score[i] = clf.loss_(y_test, pred)
plt.plot(np.arange(n_est) + 1, test_score, label='Test')
plt.plot(np.arange(n_est) + 1, clf.train_score_, label='Train')
plt.show()

However I get the following value error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-27194f883893> in <module>()
     22 test_score = np.empty(len(clf.estimators_))
     23 for i, pred in enumerate(clf.staged_predict(X_test)):
---> 24     test_score[i] = clf.loss_(y_test, pred)
     25 plt.plot(np.arange(n_est) + 1, test_score, label='Test')
     26 plt.plot(np.arange(n_est) + 1, clf.train_score_, label='Train')

C:\Documents and Settings\Philippe\Anaconda\lib\site-packages\sklearn\ensemble\gradient_boosting.pyc in __call__(self, y, pred)
    396             Y[:, k] = y == k
    397 
--> 398         return np.sum(-1 * (Y * pred).sum(axis=1) +
    399                       logsumexp(pred, axis=1))
    400 

ValueError: operands could not be broadcast together with shapes (45,3) (45) 

I know this code works fine if I use the GradientBoostingRegressor but I can't figure out how to make it work with a multiclass classifier such as the GradientBoostingClassifier. Thanks for your help.

like image 740
user3329302 Avatar asked May 06 '14 15:05

user3329302


People also ask

How to tell if a classifier is overfitting?

If the performance is significantly lower than for your training set it's overfitting @VsevolodDyomkin Thank you! So run the classifier on my test data, if it's getting 70% correct on that and it's getting 90% correct for my training data I am overfitting? Sorry, just clarifying before I run with that :)

What is overfitting in machine learning?

Overfitting is a concept when the model fits against the training dataset perfectly. While this may sound like a good fit, it is the opposite. In overfitting, the model performs far worse with unseen data. A model can be considered an ‘overfit’ when it fits the training dataset perfectly but does poorly with new test datasets.

What is an overfitting analysis?

An overfitting analysis is an approach for exploring how and when a specific model is overfitting on a specific dataset. It is a tool that can help you learn more about the learning dynamics of a machine learning model.

How do you know if your model is overfitting?

This method can approximate of how well our model will perform on new data. If our model does much better on the training set than on the test set, then we’re likely overfitting. For example, it would be a big red flag if our model saw 99% accuracy on the training set but only 55% accuracy on the test set.


1 Answers

It seems like loss_ expects an array of shape n_samples, k, whereas staged_predict returns an array of shape [n_samples] (as per the documentation). You probably want to pass in the result of staged_predict_proba or staged_decision_function into loss_.

I think you measure the loss at both train and test sets like so:

for i, pred in enumerate(clf.staged_decision_function(X_test)):
    test_score[i] = clf.loss_(y_test, pred)

for i, pred in enumerate(clf.staged_decision_function(X_train)):
    train_score[i] = clf.loss_(y_train, pred)

plot(test_score)
plot(train_score)
legend(['test score', 'train score'])

Note the second time I call loss_ I passed in the train set. The output looks like what I would expect:

enter image description here

like image 190
mbatchkarov Avatar answered Sep 21 '22 08:09

mbatchkarov