The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.
It took 30 mins to train model with no parameter tuning. If I run GridSearchCV to train model with 3 folds and 6 learning rate values, it will take more than 10 hours to return.
Incremental training saves both time and resources. Use incremental training to: Train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance.
Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.
Here's a small experiment that I ran to convince myself that it works:
First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.
import xgboost as xgb from sklearn.cross_validation import train_test_split as ttsplit from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error as mse  X = load_boston()['data'] y = load_boston()['target']  # split data into training and testing sets # then split training set in half X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0) X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,                                                       y_train,                                                       test_size=0.5,                                                      random_state=0)  xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1) xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2) xg_test = xgb.DMatrix(X_test, label=y_test)  params = {'objective': 'reg:linear', 'verbose': False} model_1 = xgb.train(params, xg_train_1, 30) model_1.save_model('model_1.model')  # ================= train two versions of the model =====================# model_2_v1 = xgb.train(params, xg_train_2, 30) model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')  print(mse(model_1.predict(xg_test), y_test))     # benchmark print(mse(model_2_v1.predict(xg_test), y_test))  # "before" print(mse(model_2_v2.predict(xg_test), y_test))  # "after"  # 23.0475232194 # 39.6776876084 # 27.2053239482 reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With