Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Hyper-parameters from H2O to re-build XGBoost in Sklearn gives Difference Performance in Python

After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches:


from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report

import xgboost as xgb
import scikitplot as skplt
import h2o
from h2o.automl import H2OAutoML
import numpy as np
import pandas as pd

h2o.init()


iris = datasets.load_iris()
X = iris.data
y = iris.target

data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1)) 
data.columns = iris.feature_names + ['target']
data = data.sample(frac=1)
# data.shape

train_df = data[:120]
test_df = data[120:]

# Import a sample binary outcome train/test set into H2O
train = h2o.H2OFrame(train_df)
test = h2o.H2OFrame(test_df)

# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
                keep_cross_validation_predictions=True,
                exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
# m.params.keys()
  1. Performance of H2O Xgboost
skplt.metrics.plot_confusion_matrix(test_df['target'], 
                                    m.predict(test).as_data_frame()['predict'], 
                                    normalize=False)

enter image description here

  1. Replicate in XGBoost Sklearn API:
mapping_dict = {
        "booster": "booster",
        "colsample_bylevel": "col_sample_rate",
        "colsample_bytree": "col_sample_rate_per_tree",
        "gamma": "min_split_improvement",
        "learning_rate": "learn_rate",
        "max_delta_step": "max_delta_step",
        "max_depth": "max_depth",
        "min_child_weight": "min_rows",
        "n_estimators": "ntrees",
        "nthread": "nthread",
        "reg_alpha": "reg_alpha",
        "reg_lambda": "reg_lambda",
        "subsample": "sample_rate",
        "seed": "seed",

        # "max_delta_step": "score_tree_interval",
        #  'missing': None,
        #  'objective': 'binary:logistic',
        #  'scale_pos_weight': 1,
        #  'silent': 1,
        #  'base_score': 0.5,
}

parameter_from_water = {}
for item in mapping_dict.items():
    parameter_from_water[item[0]] = m.params[item[1]]['actual']
# parameter_from_water

xgb_clf = xgb.XGBClassifier(**parameter_from_water)
xgb_clf.fit(train_df.drop('target', axis=1), train_df['target'])
  1. Performance of Sklearn XGBoost:
    (always worse than H2O in all examples I tried.)
skplt.metrics.plot_confusion_matrix(test_df['target'], 
                                    xgb_clf.predict(test_df.drop('target', axis=1)  ), 
                                    normalize=False);

enter image description here

Anything obvious that I missed?

like image 933
B. Sun Avatar asked Jun 17 '19 10:06

B. Sun


People also ask

How do I make XGBoost more accurate?

XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.

Does XGBoost need hyperparameter tuning?

As stated earlier, XGBoost provides large range of hyperparameters. We can leverage the maximum power of XGBoost by tuning its hyperparameters. The most powerful ML algorithm like XGBoost is famous for picking up patterns and regularities in the data by automatically tuning thousands of learnable parameters.

How many columns can XGBoost handle?

binary or Binary : No more than 32 columns per categorical feature.

Is it possible to extract hyper-parameters from H2O XGBoost using AutoML?

After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches:

What is XGBoost hyperparameter tuning?

XGBoost hyperparameter tuning with Bayesian optimization using Python March 9, 2020 August 15, 2019 by Simon Löw XGBoost is one of the leading algorithms in data science right now, giving unparalleled performance on many Kaggle competitions and real-world problems.

What is h2o-ext-XGBoost?

For each platform, H2O provide an XGBoost library with minimal configuration (supports only single CPU) that serves as fallback in case all other libraries could not be loaded. The second module, h2o-ext-xgboost, contains the actual XGBoost model and model builder code, which communicates with native XGBoost libraries via the JNI API.

Where is XGBoost on the leaderboard?

After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API.


1 Answers

When you use H2O auto ml with the following lines of code :

aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
                keep_cross_validation_predictions=True,
                exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)

you use the option nfolds = 3, which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go.

This is what you do when you train your XGBoost using fit(). Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. Hence the difference in your confusion matrices !

If you want to have the same performance when copying the best model, you can change the parameter H2OAutoML(..., nfolds = 0)


Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionary like the min_child_weight. So your xgboost is not exactly the same as your H2O which could explain the differences in performance

like image 141
vlemaistre Avatar answered Sep 23 '22 06:09

vlemaistre