Learning curve (high bias / high variance) why the testing learning curve gets flat

Question

I implemented a model using gradient boosting decision tree as classifier and I plotted learning curves for both training and test sets to decide what to do next in order to improve my model. The result is as the image:

(Y axis is accuracy (percentage of correct prediction) while x axis is the number of samples i use to train the model.) enter image description here

I understand that the gap between training and testing score is probably due to high variance(overfitting). But the image also shows that the test score (the green line) increases very little while the number of samples grows from 2000 to 3000. The curve of testing score is getting flat. The model is not getting better even with more samples.

My understand is that a flat learning curve usually indicates high bias (underfitting). Is that possible that both underfitting and overfitting are happening in this model? Or is there another explanation for the flat curve?

Any help would be appreciated. Thanks in advance.

=====================================

the code i use is as follows. Basic i use the same code as the example in sklearn document

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                    n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

title = "Learning Curves (GBDT)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = GradientBoostingClassifier(n_estimators=450)
X,y= features, target #features and target are already loaded
plot_learning_curve(estimator, title, X, y, ylim=(0.6, 1.01), cv=cv, n_jobs=4)

plt.show()

Peque · Accepted Answer

I would say you are overfitting. Considering you are using cross validation, the gap between the training and the cross-validation score is probably too big. Without cross validation or random splitting, it could be that your train and test data differ in some way.

There are a couple of ways you could try to mitigate this:

Add more data (the training score will probably still go down a little bit more)
Reduce the number of estimators, or even better, use early stopping
Increase gamma for prunning
Use subsampling (by tree, by column...)

There are lots of parameters that you can play with, so have some fun! :-D

Learning curve (high bias / high variance) why the testing learning curve gets flat

Tags:

python

machine-learning

classification

scikit-learn

Huge Min

1 Answers

Peque

Recent Activity

Donate For Us

Learning curve (high bias / high variance) why the testing learning curve gets flat

Tags:

python

machine-learning

classification

scikit-learn

Huge Min

1 Answers

Peque

Related questions

Recent Activity

Donate For Us