Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Learning curve (high bias / high variance) why the testing learning curve gets flat

I implemented a model using gradient boosting decision tree as classifier and I plotted learning curves for both training and test sets to decide what to do next in order to improve my model. The result is as the image:

(Y axis is accuracy (percentage of correct prediction) while x axis is the number of samples i use to train the model.) enter image description here

I understand that the gap between training and testing score is probably due to high variance(overfitting). But the image also shows that the test score (the green line) increases very little while the number of samples grows from 2000 to 3000. The curve of testing score is getting flat. The model is not getting better even with more samples.

My understand is that a flat learning curve usually indicates high bias (underfitting). Is that possible that both underfitting and overfitting are happening in this model? Or is there another explanation for the flat curve?

Any help would be appreciated. Thanks in advance.

=====================================

the code i use is as follows. Basic i use the same code as the example in sklearn document

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                    n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

title = "Learning Curves (GBDT)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = GradientBoostingClassifier(n_estimators=450)
X,y= features, target #features and target are already loaded
plot_learning_curve(estimator, title, X, y, ylim=(0.6, 1.01), cv=cv, n_jobs=4)

plt.show()
like image 536
Huge Min Avatar asked Nov 19 '25 13:11

Huge Min


1 Answers

I would say you are overfitting. Considering you are using cross validation, the gap between the training and the cross-validation score is probably too big. Without cross validation or random splitting, it could be that your train and test data differ in some way.

There are a couple of ways you could try to mitigate this:

  • Add more data (the training score will probably still go down a little bit more)
  • Reduce the number of estimators, or even better, use early stopping
  • Increase gamma for prunning
  • Use subsampling (by tree, by column...)

There are lots of parameters that you can play with, so have some fun! :-D

like image 164
Peque Avatar answered Nov 22 '25 02:11

Peque



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!