Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I implement incremental training for xgboost?

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

like image 703
Marat Zakirov Avatar asked Jun 28 '16 15:06

Marat Zakirov


People also ask

How long does it take to train an XGBoost model?

It took 30 mins to train model with no parameter tuning. If I run GridSearchCV to train model with 3 folds and 6 learning rate values, it will take more than 10 hours to return.

What is incremental training?

Incremental training saves both time and resources. Use incremental training to: Train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance.


1 Answers

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here's a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb from sklearn.cross_validation import train_test_split as ttsplit from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error as mse  X = load_boston()['data'] y = load_boston()['target']  # split data into training and testing sets # then split training set in half X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0) X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,                                                       y_train,                                                       test_size=0.5,                                                      random_state=0)  xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1) xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2) xg_test = xgb.DMatrix(X_test, label=y_test)  params = {'objective': 'reg:linear', 'verbose': False} model_1 = xgb.train(params, xg_train_1, 30) model_1.save_model('model_1.model')  # ================= train two versions of the model =====================# model_2_v1 = xgb.train(params, xg_train_2, 30) model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')  print(mse(model_1.predict(xg_test), y_test))     # benchmark print(mse(model_2_v1.predict(xg_test), y_test))  # "before" print(mse(model_2_v2.predict(xg_test), y_test))  # "after"  # 23.0475232194 # 39.6776876084 # 27.2053239482 

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

like image 80
Alain Avatar answered Sep 23 '22 08:09

Alain