I trained h2o automl and got a leader model with satisfying metrics. I want to retrain the model periodically but without using checkpoint. So, I guess all I need are the best parameters of the leader model to run it manually. I know automlmodels.leader.params but it gives a list of all parameters tried. How can I get the best ones as found in the leaderboard?
Here's a solution using the example from the H2O AutoML User Guide. The parameters for any model are stored in the model.params
location. So if you want to grab the parameters for the leader model, then you can access that here: aml.leader.params
. If you wanted another model, you would grab that model into an object in Python using the h2o.get_model()
function and similarly, access the params using .params
.
The .params
object is a dictionary which stores all the parameter values (default and actual).
import h2o
from h2o.automl import H2OAutoML
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)
The top of the leaderboard looks like this:
In [3]: aml.leaderboard
Out[3]:
model_id auc logloss mean_per_class_error rmse mse
--------------------------------------------------- -------- --------- ---------------------- -------- --------
StackedEnsemble_AllModels_AutoML_20190309_152507 0.788879 0.552328 0.315963 0.432607 0.187149
StackedEnsemble_BestOfFamily_AutoML_20190309_152507 0.787642 0.553538 0.317995 0.433144 0.187614
XGBoost_1_AutoML_20190309_152507 0.785199 0.557134 0.327844 0.434681 0.188948
XGBoost_grid_1_AutoML_20190309_152507_model_4 0.783523 0.557854 0.318819 0.435249 0.189441
XGBoost_grid_1_AutoML_20190309_152507_model_3 0.783004 0.559613 0.325081 0.435708 0.189841
XGBoost_2_AutoML_20190309_152507 0.782186 0.558342 0.335769 0.435571 0.189722
XGBoost_3_AutoML_20190309_152507 0.7815 0.55952 0.319151 0.436034 0.190126
GBM_5_AutoML_20190309_152507 0.780837 0.559903 0.340848 0.436191 0.190263
GBM_2_AutoML_20190309_152507 0.780036 0.559806 0.339926 0.436415 0.190458
GBM_1_AutoML_20190309_152507 0.779827 0.560857 0.335096 0.436616 0.190633
[22 rows x 6 columns]
Here the leader is a Stacked Ensemble. We can look at the parameter names like this:
In [6]: aml.leader.params.keys()
Out[6]: dict_keys(['model_id', 'training_frame', 'response_column', 'validation_frame', 'base_models', 'metalearner_algorithm', 'metalearner_nfolds', 'metalearner_fold_assignment', 'metalearner_fold_column', 'keep_levelone_frame', 'metalearner_params', 'seed', 'export_checkpoints_dir'])
In [7]: aml.leader.params['metalearner_algorithm']
Out[7]: {'default': 'AUTO', 'actual': 'AUTO'}
If you are interested in the GLM (as you mentioned above), then you can grab it like this and examine the hyperparameter values.
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the GLM model
m = h2o.get_model([mid for mid in model_ids if "GLM" in mid][0])
Now look at the parameter names and then check out the default and actual values:
In [11]: m.params.keys()
Out[11]: dict_keys(['model_id', 'training_frame', 'validation_frame', 'nfolds', 'seed', 'keep_cross_validation_models', 'keep_cross_validation_predictions', 'keep_cross_validation_fold_assignment', 'fold_assignment', 'fold_column', 'response_column', 'ignored_columns', 'ignore_const_cols', 'score_each_iteration', 'offset_column', 'weights_column', 'family', 'tweedie_variance_power', 'tweedie_link_power', 'solver', 'alpha', 'lambda', 'lambda_search', 'early_stopping', 'nlambdas', 'standardize', 'missing_values_handling', 'compute_p_values', 'remove_collinear_columns', 'intercept', 'non_negative', 'max_iterations', 'objective_epsilon', 'beta_epsilon', 'gradient_epsilon', 'link', 'prior', 'lambda_min_ratio', 'beta_constraints', 'max_active_predictors', 'interactions', 'interaction_pairs', 'obj_reg', 'export_checkpoints_dir', 'balance_classes', 'class_sampling_factors', 'max_after_balance_size', 'max_confusion_matrix_size', 'max_hit_ratio_k', 'max_runtime_secs', 'custom_metric_func'])
In [12]: m.params['nlambdas']
Out[12]: {'default': -1, 'actual': 30}
To further Erin LeDell's answer, if you want to use the BestOfFamily model as recommended by the AutoMl documentation ("The 'Best of Family' ensemble is optimized for production use since it only contains six (or fewer) base_models"):
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
Getting the hyperparameters of the base_models so that you can retrain on different data is a bit more involved:
Similar to the last answer, we can start by outputting the leaderboard:
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_runtime_secs=int(60*30), seed = 1)
aml.train(x=predictors, y=response, training_frame=df_h20)
lb = aml.leaderboard
lbdf = lb.as_data_frame()
lbdf.head()
yields:
AutoML progress: |████████████████████████████████████████████████████████| 100%
model_id mean_residual_deviance rmse mse mae rmsle
0 StackedEnsemble_BestOfFamily_AutoML_20190618_1... 6.960772 2.638328 6.960772 1.880983 0.049275
1 StackedEnsemble_AllModels_AutoML_20190618_145827 6.960772 2.638328 6.960772 1.880983 0.049275
2 GBM_1_AutoML_20190618_145827 7.507970 2.740068 7.507970 1.934916 0.050984
3 DRF_1_AutoML_20190618_145827 7.781256 2.789490 7.781256 1.959508 0.051684
4 GLM_grid_1_AutoML_20190618_145827_model_1 9.503375 3.082754 9.503375 2.273755 0.058174
5 GBM_2_AutoML_20190618_145827 18.464452 4.297028 18.464452 3.259346 0.079722
However, using m.params.keys()
shows no way of getting the base_model hyperparameters:
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model(model_ids[0])
m.params['base_models']
returning:
{'default': [],
'actual': [{'__meta': {'schema_version': 3,
'schema_name': 'ModelKeyV3',
'schema_type': 'Key<Model>'},
'name': 'GBM_1_AutoML_20190618_145827',
'type': 'Key<Model>',
'URL': '/3/Models/GBM_1_AutoML_20190618_145827'},
{'__meta': {'schema_version': 3,
'schema_name': 'ModelKeyV3',
'schema_type': 'Key<Model>'},
'name': 'DRF_1_AutoML_20190618_145827',
'type': 'Key<Model>',
'URL': '/3/Models/DRF_1_AutoML_20190618_145827'},
{'__meta': {'schema_version': 3,
'schema_name': 'ModelKeyV3',
'schema_type': 'Key<Model>'},
'name': 'GLM_grid_1_AutoML_20190618_145827_model_1',
'type': 'Key<Model>',
'URL': '/3/Models/GLM_grid_1_AutoML_20190618_145827_model_1'}]}
You have to get a list of the URL of every base_model:
urllist = []
for model in m.params['base_models']['actual']:
urllist.append(model['URL'])
print(urllist)
giving:
['/3/Models/GBM_1_AutoML_20190618_145827', '/3/Models/DRF_1_AutoML_20190618_145827', '/3/Models/GLM_grid_1_AutoML_20190618_145827_model_1']
And then after that, you can see which hyperparameters are non-default by using the requests library:
for url in urllist:
r = requests.get("http://localhost:54321"+url)
model = r.json()
print(url)
for i in np.arange(len(model['models'][0]['parameters'])):
if model['models'][0]['parameters'][i]['label'] in ['model_id','training_frame','validation_frame','response_column']:
continue
if model['models'][0]['parameters'][i]['default_value'] != model['models'][0]['parameters'][i]['actual_value']:
print(model['models'][0]['parameters'][i]['label'])
print(model['models'][0]['parameters'][i]['actual_value'])
print(" ")
Besides the above, you can connect to the H2O server (FLOW) at the local URL "http://localhost:54321" (or other port you are running on) and click on the model you want and inspect the parameters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With