How can I implement incremental training for xgboost?

Tags:

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

703

asked Jun 28 '16 15:06

Marat Zakirov

1 Answers

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

Here's a small experiment that I ran to convince myself that it works:

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

Click to copy

import xgboost as xgb from sklearn.cross_validation import train_test_split as ttsplit from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error as mse  X = load_boston()['data'] y = load_boston()['target']  # split data into training and testing sets # then split training set in half X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0) X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,                                                       y_train,                                                       test_size=0.5,                                                      random_state=0)  xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1) xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2) xg_test = xgb.DMatrix(X_test, label=y_test)  params = {'objective': 'reg:linear', 'verbose': False} model_1 = xgb.train(params, xg_train_1, 30) model_1.save_model('model_1.model')  # ================= train two versions of the model =====================# model_2_v1 = xgb.train(params, xg_train_2, 30) model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')  print(mse(model_1.predict(xg_test), y_test))     # benchmark print(mse(model_2_v1.predict(xg_test), y_test))  # "before" print(mse(model_2_v2.predict(xg_test), y_test))  # "after"  # 23.0475232194 # 39.6776876084 # 27.2053239482

reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

answered Sep 23 '22 08:09

Alain

Related questions
                            
                                Selenium waitForElement
                            
                                Python Conditional Variable Setting
                            
                                import matplotlib.pyplot hangs
                            
                                Extract matplotlib colormap in hex-format
                            
                                Can I get the exception from the finally block in python?
                            
                                How to remove repeated elements in a vector, similar to 'set' in Python
                            
                                Selection with .loc in python
                            
                                Using fourier analysis for time series prediction
                            
                                How do you directly overlay a scatter plot on top of a jpg image in matplotlib / Python?
                            
                                How to create/customize your own scorer function in scikit-learn?
                            
                                How do you create a custom activation function with Keras?
                            
                                Python regex findall
                            
                                Save Naive Bayes Trained Classifier in NLTK
                            
                                scikit-learn random state in splitting dataset
                            
                                Quick way to extend a set if we know elements are unique
                            
                                pyodbc insert into sql
                            
                                PyYAML dump format
                            
                                How to set the root directory for Visual Studio Code Python Extension?
                            
                                How is `x = 42; x = lambda: x` parsed?
                            
                                Simple file server to serve current directory [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I implement incremental training for xgboost?

Tags:

python

machine-learning

xgboost

Marat Zakirov

People also ask

1 Answers

Alain

Recent Activity

Donate For Us