After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches: <pre class="prettyprint lang-py prettyprint-override"><code> from sklearn import datasets from sklearn.model_selection import train_test_split, cross_val_predict from sklearn.metrics import classification_report import xgboost as xgb import scikitplot as skplt import h2o from h2o.automl import H2OAutoML import numpy as np import pandas as pd h2o.init() iris = datasets.load_iris() X = iris.data y = iris.target data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1)) data.columns = iris.feature_names + ['target'] data = data.sample(frac=1) # data.shape train_df = data[:120] test_df = data[120:] # Import a sample binary outcome train/test set into H2O train = h2o.H2OFrame(train_df) test = h2o.H2OFrame(test_df) # Identify predictors and response x = train.columns y = "target" x.remove(y) # For binary classification, response should be a factor train[y] = train[y].asfactor() test[y] = test[y].asfactor() aml = H2OAutoML(max_models=10, seed=1, nfolds = 3, keep_cross_validation_predictions=True, exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"]) aml.train(x=x, y=y, training_frame=train) # View the AutoML Leaderboard lb = aml.leaderboard lb.head(rows=lb.nrows) model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0]) m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0]) # m.params.keys() </code></pre> <ol> <li>Performance of H2O Xgboost</li> </ol> <pre class="prettyprint lang-py prettyprint-override"><code>skplt.metrics.plot_confusion_matrix(test_df['target'], m.predict(test).as_data_frame()['predict'], normalize=False) </code></pre> <img src="https://i.stack.imgur.com/R8bHP.png" alt="enter image description here"> <ol start="2"> <li>Replicate in XGBoost Sklearn API:</li> </ol> <pre class="prettyprint lang-py prettyprint-override"><code>mapping_dict = { "booster": "booster", "colsample_bylevel": "col_sample_rate", "colsample_bytree": "col_sample_rate_per_tree", "gamma": "min_split_improvement", "learning_rate": "learn_rate", "max_delta_step": "max_delta_step", "max_depth": "max_depth", "min_child_weight": "min_rows", "n_estimators": "ntrees", "nthread": "nthread", "reg_alpha": "reg_alpha", "reg_lambda": "reg_lambda", "subsample": "sample_rate", "seed": "seed", # "max_delta_step": "score_tree_interval", # 'missing': None, # 'objective': 'binary:logistic', # 'scale_pos_weight': 1, # 'silent': 1, # 'base_score': 0.5, } parameter_from_water = {} for item in mapping_dict.items(): parameter_from_water[item[0]] = m.params[item[1]]['actual'] # parameter_from_water xgb_clf = xgb.XGBClassifier(**parameter_from_water) xgb_clf.fit(train_df.drop('target', axis=1), train_df['target']) </code></pre> <ol start="3"> <li>Performance of Sklearn XGBoost: (always worse than H2O in all examples I tried.) </li> </ol> <pre class="prettyprint lang-py prettyprint-override"><code>skplt.metrics.plot_confusion_matrix(test_df['target'], xgb_clf.predict(test_df.drop('target', axis=1) ), normalize=False); </code></pre> <img src="https://i.stack.imgur.com/lnd9t.png" alt="enter image description here"> Anything obvious that I missed?

When you use H2O auto ml with the following lines of code : <pre class="prettyprint"><code>aml = H2OAutoML(max_models=10, seed=1, nfolds = 3, keep_cross_validation_predictions=True, exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"]) aml.train(x=x, y=y, training_frame=train) </code></pre> you use the option <code>nfolds = 3</code>, which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go. This is what you do when you train your XGBoost using <code>fit()</code>. Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. Hence the difference in your confusion matrices ! If you want to have the same performance when copying the best model, you can change the parameter <code>H2OAutoML(..., nfolds = 0)</code> <hr> Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionary like the <code>min_child_weight</code>. So your xgboost is not exactly the same as your H2O which could explain the differences in performance

Using Hyper-parameters from H2O to re-build XGBoost in Sklearn gives Difference Performance in Python

Tags:

python

machine-learning

scikit-learn

xgboost

h2o

After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches:


from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report

import xgboost as xgb
import scikitplot as skplt
import h2o
from h2o.automl import H2OAutoML
import numpy as np
import pandas as pd

h2o.init()


iris = datasets.load_iris()
X = iris.data
y = iris.target

data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1)) 
data.columns = iris.feature_names + ['target']
data = data.sample(frac=1)
# data.shape

train_df = data[:120]
test_df = data[120:]

# Import a sample binary outcome train/test set into H2O
train = h2o.H2OFrame(train_df)
test = h2o.H2OFrame(test_df)

# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
                keep_cross_validation_predictions=True,
                exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
# m.params.keys()

Performance of H2O Xgboost

skplt.metrics.plot_confusion_matrix(test_df['target'], 
                                    m.predict(test).as_data_frame()['predict'], 
                                    normalize=False)

enter image description here

Replicate in XGBoost Sklearn API:

mapping_dict = {
        "booster": "booster",
        "colsample_bylevel": "col_sample_rate",
        "colsample_bytree": "col_sample_rate_per_tree",
        "gamma": "min_split_improvement",
        "learning_rate": "learn_rate",
        "max_delta_step": "max_delta_step",
        "max_depth": "max_depth",
        "min_child_weight": "min_rows",
        "n_estimators": "ntrees",
        "nthread": "nthread",
        "reg_alpha": "reg_alpha",
        "reg_lambda": "reg_lambda",
        "subsample": "sample_rate",
        "seed": "seed",

        # "max_delta_step": "score_tree_interval",
        #  'missing': None,
        #  'objective': 'binary:logistic',
        #  'scale_pos_weight': 1,
        #  'silent': 1,
        #  'base_score': 0.5,
}

parameter_from_water = {}
for item in mapping_dict.items():
    parameter_from_water[item[0]] = m.params[item[1]]['actual']
# parameter_from_water

xgb_clf = xgb.XGBClassifier(**parameter_from_water)
xgb_clf.fit(train_df.drop('target', axis=1), train_df['target'])

Performance of Sklearn XGBoost:
(always worse than H2O in all examples I tried.)

skplt.metrics.plot_confusion_matrix(test_df['target'], 
                                    xgb_clf.predict(test_df.drop('target', axis=1)  ), 
                                    normalize=False);

enter image description here

Anything obvious that I missed?

933

asked Jun 17 '19 10:06

B. Sun

1 Answers

When you use H2O auto ml with the following lines of code :

aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
                keep_cross_validation_predictions=True,
                exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)

you use the option nfolds = 3, which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go.

This is what you do when you train your XGBoost using fit(). Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. Hence the difference in your confusion matrices !

If you want to have the same performance when copying the best model, you can change the parameter H2OAutoML(..., nfolds = 0)

Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionary like the min_child_weight. So your xgboost is not exactly the same as your H2O which could explain the differences in performance

141

answered Sep 23 '22 06:09

vlemaistre

Related questions
                            
                                'Tensor' object has no attribute 'numpy' in tf.function in TF 2.0
                            
                                Determine allocation of values - Python
                            
                                How to vectorize a loop through pandas series when values are used in slice of another series
                            
                                Snake traversal of 2D NumPy array
                            
                                Numpy trim_zeros in 2D or 3D
                            
                                How to avoid PyCharm console crash "WARNING: QApplication was not created in the main() thread" when plotting with matplotlib?
                            
                                What is the way to use Tensor flow 2.0 object in open cv2 python and why is it so circuitous?
                            
                                Check if python dictionaries are equal, allowing small difference for floats
                            
                                What is "_ipython_canary_method_should_not_exist_"?
                            
                                How to open file as formatted string?
                            
                                Creating an order-preserving multi-value dict for Django
                            
                                Deleting existing class variable yield AttributeError
                            
                                Using tkinter -- How to clear FigureCanvasTkAgg object if exists or similar?
                            
                                Fitting a quadratic function in python without numpy polyfit
                            
                                Airflow webserver suddenly stopped starting
                            
                                Mask-RCNN with Keras : Tried to convert 'shape' to a tensor and failed. Error: None values not supported
                            
                                Unable to solve the XOR problem with just two hidden neurons in Python
                            
                                Printing a dataframe from a function nicely as in Jupyter [duplicate]
                            
                                Selection field with widget="radio" not getting required effect applied with attrs in XML file in Odoo 12
                            
                                What's the difference between `driver.execute_script("...")` and `driver.get("javascript: ..."` with geckodriver/Firefox?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With