I would like to use the xgboost cv function to find the best parameters for my training data set. I am confused by the api. How do I find the best parameter? Is this similar to the sklearn grid_search
cross-validation function? How can I find which of the options for the max_depth
parameter ([2,4,6]) was determined optimal?
from sklearn.datasets import load_iris
import xgboost as xgb
iris = load_iris()
DTrain = xgb.DMatrix(iris.data, iris.target)
x_parameters = {"max_depth":[2,4,6]}
xgb.cv(x_parameters, DTrain)
...
Out[6]:
test-rmse-mean test-rmse-std train-rmse-mean train-rmse-std
0 0.888435 0.059403 0.888052 0.022942
1 0.854170 0.053118 0.851958 0.017982
2 0.837200 0.046986 0.833532 0.015613
3 0.829001 0.041960 0.824270 0.014501
4 0.825132 0.038176 0.819654 0.013975
5 0.823357 0.035454 0.817363 0.013722
6 0.822580 0.033540 0.816229 0.013598
7 0.822265 0.032209 0.815667 0.013538
8 0.822158 0.031287 0.815390 0.013508
9 0.822140 0.030647 0.815252 0.013494
XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees.
Another way to perform cross-validation with XGBoost is to use XGBoost's own non-Scikit-learn compatible API. “Non-Scikit-learn compatible” means that here we do not use the Scikit-learn cross_val_score() function, instead we use XGBoost's cv() function with explicitly created DMatrices.
Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.
nrounds : the number of decision trees in the final model. objective : the training objective to use, where “binary:logistic” means a binary classifier.
You can use GridSearchCV with xgboost through xgboost sklearn API
Define your classifier as follows:
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
xgb_model = XGBClassifier(other_params)
test_params = {
'max_depth':[4,8,12]
}
model = GridSearchCV(estimator = xgb_model,param_grid = test_params)
model.fit(train,target)
print model.best_params_
Cross-validation is used for estimating the performance of one set of parameters on unseen data.
Grid-search evaluates a model with varying parameters to find the best possible combination of these.
The sklearn docs talks a lot about CV, and they can be used in combination, but they each have very different purposes.
You might be able to fit xgboost into sklearn's gridsearch functionality. Check out the sklearn interface to xgboost for the most smooth application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With