How to graph whether overfitting takes place in a multiclass classifier

Tags:

scikit-learn

I want to monitor the loss during the training of a multiclass Gradient Boosting Classifier as a way to know whether there is overfitting taking place or not. Here is my code:

%matplotlib inline
import numpy as np
#import matplotlib.pyplot as plt
import matplotlib.pylab as plt
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

iris = datasets.load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

n_est = 100
clf = GradientBoostingClassifier(n_estimators=n_est, max_depth=3, random_state=2)
clf.fit(X_train, y_train)


test_score = np.empty(len(clf.estimators_))
for i, pred in enumerate(clf.staged_predict(X_test)):
    test_score[i] = clf.loss_(y_test, pred)
plt.plot(np.arange(n_est) + 1, test_score, label='Test')
plt.plot(np.arange(n_est) + 1, clf.train_score_, label='Train')
plt.show()

However I get the following value error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-27194f883893> in <module>()
     22 test_score = np.empty(len(clf.estimators_))
     23 for i, pred in enumerate(clf.staged_predict(X_test)):
---> 24     test_score[i] = clf.loss_(y_test, pred)
     25 plt.plot(np.arange(n_est) + 1, test_score, label='Test')
     26 plt.plot(np.arange(n_est) + 1, clf.train_score_, label='Train')

C:\Documents and Settings\Philippe\Anaconda\lib\site-packages\sklearn\ensemble\gradient_boosting.pyc in __call__(self, y, pred)
    396             Y[:, k] = y == k
    397 
--> 398         return np.sum(-1 * (Y * pred).sum(axis=1) +
    399                       logsumexp(pred, axis=1))
    400 

ValueError: operands could not be broadcast together with shapes (45,3) (45)

I know this code works fine if I use the GradientBoostingRegressor but I can't figure out how to make it work with a multiclass classifier such as the GradientBoostingClassifier. Thanks for your help.

740

asked May 06 '14 15:05

user3329302

1 Answers

It seems like loss_ expects an array of shape n_samples, k, whereas staged_predict returns an array of shape [n_samples] (as per the documentation). You probably want to pass in the result of staged_predict_proba or staged_decision_function into loss_.

I think you measure the loss at both train and test sets like so:

for i, pred in enumerate(clf.staged_decision_function(X_test)):
    test_score[i] = clf.loss_(y_test, pred)

for i, pred in enumerate(clf.staged_decision_function(X_train)):
    train_score[i] = clf.loss_(y_train, pred)

plot(test_score)
plot(train_score)
legend(['test score', 'train score'])

Note the second time I call loss_ I passed in the train set. The output looks like what I would expect:

enter image description here

190

answered Sep 21 '22 08:09

mbatchkarov

Related questions
                            
                                Is it costly in Python to put classes in different files?
                            
                                Splitting a list by first character of each element
                            
                                PyQt4 what is the best way to center dialog windows?
                            
                                can't install scipy on mac OS X
                            
                                scipy optimize.curve_fit cannot fit a function whose return value depends on a conditional
                            
                                GAE doesn't import gflags
                            
                                When I run the full test suite in Django, I get errors about missing MessageMiddleware
                            
                                Detecting the end of the stream on popen.stdout.readline
                            
                                Why does overriding __contains__ break OrderedDict.keys?
                            
                                How to eliminate a python3 deprecation warning for the equality operator?
                            
                                Remove special characters from csv file using python
                            
                                Python/Django 1.5 DatabaseWrapper thread error
                            
                                Serve image from GAE datastore with Flask (python)
                            
                                Parsing User Defined Types Using PyArg_ParseTuple
                            
                                Python: How can I use ggplot with a simple 2 column array?
                            
                                Weird lambda behaviour in loops [duplicate]
                            
                                Aptana: Exclude files when deploying a project to App Engine
                            
                                Writing (and not) to global variable in Python
                            
                                Does assertRaises (or assert_raises) exist in nose2
                            
                                Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With