After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches:
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report
import xgboost as xgb
import scikitplot as skplt
import h2o
from h2o.automl import H2OAutoML
import numpy as np
import pandas as pd
h2o.init()
iris = datasets.load_iris()
X = iris.data
y = iris.target
data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1))
data.columns = iris.feature_names + ['target']
data = data.sample(frac=1)
# data.shape
train_df = data[:120]
test_df = data[120:]
# Import a sample binary outcome train/test set into H2O
train = h2o.H2OFrame(train_df)
test = h2o.H2OFrame(test_df)
# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
# m.params.keys()
skplt.metrics.plot_confusion_matrix(test_df['target'],
m.predict(test).as_data_frame()['predict'],
normalize=False)
mapping_dict = {
"booster": "booster",
"colsample_bylevel": "col_sample_rate",
"colsample_bytree": "col_sample_rate_per_tree",
"gamma": "min_split_improvement",
"learning_rate": "learn_rate",
"max_delta_step": "max_delta_step",
"max_depth": "max_depth",
"min_child_weight": "min_rows",
"n_estimators": "ntrees",
"nthread": "nthread",
"reg_alpha": "reg_alpha",
"reg_lambda": "reg_lambda",
"subsample": "sample_rate",
"seed": "seed",
# "max_delta_step": "score_tree_interval",
# 'missing': None,
# 'objective': 'binary:logistic',
# 'scale_pos_weight': 1,
# 'silent': 1,
# 'base_score': 0.5,
}
parameter_from_water = {}
for item in mapping_dict.items():
parameter_from_water[item[0]] = m.params[item[1]]['actual']
# parameter_from_water
xgb_clf = xgb.XGBClassifier(**parameter_from_water)
xgb_clf.fit(train_df.drop('target', axis=1), train_df['target'])
skplt.metrics.plot_confusion_matrix(test_df['target'],
xgb_clf.predict(test_df.drop('target', axis=1) ),
normalize=False);
Anything obvious that I missed?
XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.
As stated earlier, XGBoost provides large range of hyperparameters. We can leverage the maximum power of XGBoost by tuning its hyperparameters. The most powerful ML algorithm like XGBoost is famous for picking up patterns and regularities in the data by automatically tuning thousands of learnable parameters.
binary or Binary : No more than 32 columns per categorical feature.
After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches:
XGBoost hyperparameter tuning with Bayesian optimization using Python March 9, 2020 August 15, 2019 by Simon Löw XGBoost is one of the leading algorithms in data science right now, giving unparalleled performance on many Kaggle competitions and real-world problems.
For each platform, H2O provide an XGBoost library with minimal configuration (supports only single CPU) that serves as fallback in case all other libraries could not be loaded. The second module, h2o-ext-xgboost, contains the actual XGBoost model and model builder code, which communicates with native XGBoost libraries via the JNI API.
After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API.
When you use H2O auto ml with the following lines of code :
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
you use the option nfolds = 3
, which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go.
This is what you do when you train your XGBoost using fit()
. Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. Hence the difference in your confusion matrices !
If you want to have the same performance when copying the best model, you can change the parameter H2OAutoML(..., nfolds = 0)
Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionary like the min_child_weight
. So your xgboost is not exactly the same as your H2O which could explain the differences in performance
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With