Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop gradient boosting machine from overfitting?

I am comparing a few models (gradient boosting machine, random forest, logistic regression, SVM, multilayer perceptron, and keras neural network) on a multiclassification problem. I have used nested cross validation and grid search on my models, running these on my actual data and also randomised data to check for overfitting. However, for the gradient boosting machine I am finding, no matter how I change my data or model parameters, it is giving me 100% accuracy on the random data every time. Is there something in my code that could be causing this?

Here is my code:

dataset= pd.read_csv('data.csv')
data = dataset.drop(["gene"],1)
df = data.iloc[:,0:26]
df = df.fillna(0)
X = MinMaxScaler().fit_transform(df)

le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])
Y = le.fit_transform(data["category"])

sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)

seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[100, 200, 500, 1000]}

rfc =RandomForestClassifier(n_estimators=500)
param_grid = {"max_depth": [3],
             "max_features": ["auto"],
              "min_samples_split": [2],
              "min_samples_leaf": [1],
              "bootstrap": [False],
              "criterion": ["entropy", "gini"]}


mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(50,50,50)],
     'activation': ['relu'],
     'solver': ['adam'],
     'max_iter': [10000],
     'alpha': [0.0001],
     'learning_rate': ['constant']}

gbm = GradientBoostingClassifier()
param = {"loss":["deviance"],
    "learning_rate": [0.001],
    "min_samples_split": [2],
    "min_samples_leaf": [1],
    "max_depth":[3],
    "max_features":["auto"],
    "criterion": ["friedman_mse"],
    "n_estimators":[50]
    }

svm = SVC(gamma="scale")
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}

inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)

outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)


def baseline_model():

    model = Sequential()
    model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) #dense layers perform: output = activation(dot(input, kernel) + bias).
    model.add(Dropout(0.5))
    model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)
    model.add(Dense(4, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

models = []

models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('Keras', KerasClassifier(build_fn=baseline_model, epochs=100, batch_size=50, verbose=0)))

results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)


for name, model in models:
    nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
    results.append(nested_cv_results)
    names.append(name)
    msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
    print(msg)
    model.fit(X_train, Y_train)
    print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')

Output:

Nested CV Accuracy GBM: 90.952381 (+/- 2.776644 )
Test set accuracy: 90.48 %
Nested CV Accuracy RFC: 79.285714 (+/- 5.112122 )
Test set accuracy: 75.00 %
Nested CV Accuracy LR: 91.904762 (+/- 4.416009 )
Test set accuracy: 92.86 %
Nested CV Accuracy SVM: 94.285714 (+/- 3.563483 )
Test set accuracy: 96.43 %
Nested CV Accuracy MLP: 91.428571 (+/- 4.012452 )
Test set accuracy: 92.86 %

Random data code:

ran = np.random.randint(4, size=161)
random = np.random.normal(500, 100, size=(161,161))
rand = np.column_stack((random, ran))
print(rand.shape)
X1 = rand[:161]
Y1 = rand[:,-1]
print("Random data counts of label '1': {}".format(sum(ran==1)))
print("Random data counts of label '0': {}".format(sum(ran==0)))
print("Random data counts of label '2': {}".format(sum(ran==2)))
print("Random data counts of label '3': {}".format(sum(ran==3)))

for name, model in models:
    cv_results = model_selection.cross_val_score(model, X1, Y1,  cv=outer_cv, scoring=scoring)
    names.append(name)
    msg = "Random data CV %s: %f (+/- %f)" % (name, cv_results.mean()*100, cv_results.std()*100)
    print(msg)

Random data output:

Random data CV GBM: 100.000000 (+/- 0.000000)
Random data CV RFC: 62.941176 (+/- 15.306485)
Random data CV LR: 23.566176 (+/- 6.546699)
Random data CV SVM: 22.352941 (+/- 6.331220)
Random data CV MLP: 23.639706 (+/- 7.371392)
Random data CV Keras: 22.352941 (+/- 8.896451)

This gradient boosting classifier (GBM) is at 100% whether I reduce the number of features, change the parameters in the grid search (I do put in multiple parameters however this can run for hours for me without results so I have left that problem for now), and is also the same if I try binary classification data.

The random forest (RFC) is also higher at 62%, is there something I am doing wrong?

The data I am using is predominantly binary features, as an example looking like this (and predicting the category column):

gene   Tissue    Druggable Eigenvalue CADDvalue Catalogpresence   Category
ACE      1           1         1          0           1            Certain
ABO      1           0         0          0           0            Likely
TP53     1           1         0          0           0            Possible

Any guidance would be appreciated.

like image 236
DN1 Avatar asked Apr 09 '19 13:04

DN1


People also ask

How do you stop overfitting in gradient boosting?

Regularization techniques are used to reduce the overfitting effect, eliminating the degradation by ensuring the fitting procedure is constrained. One popular regularization parameter is M, which denotes the number of iterations of gradient boosting.

Is gradient boosting prone to overfitting?

Unlike random forests, gradient boosted trees can overfit. Therefore, as for neural networks, you can apply regularization and early stopping using a validation dataset. For example, the following figures show loss and accuracy curves for training and validation sets when training a GBT model.

How do I know if my gradient boost is overfitting?

Only way to check if model is overfitting is to train a model and compare its results on unseen data. I train the models with the data and compute the AUC with the same training data.


1 Answers

In general, there are a few parameters you can play with to reduce overfitting. The easiest to conceptually understand is to increase min_samples_split and min_samples_leaf. Setting higher values for these will not allow the model to memorize how to correctly identify a single piece of data or very small groups of data. For a large data set (~1 mil rows), I would place these values at around 50 if not higher. You can do a a grid search to find values that work well for your specific data.

You can also use subsample to reduce overfitting as well as max_features. These parameters basically don't let your model look at some of the data which prevents it from memorizing it.

like image 65
sonia Avatar answered Oct 13 '22 11:10

sonia