Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find best params of leader model in automl h2o python

I trained h2o automl and got a leader model with satisfying metrics. I want to retrain the model periodically but without using checkpoint. So, I guess all I need are the best parameters of the leader model to run it manually. I know automlmodels.leader.params but it gives a list of all parameters tried. How can I get the best ones as found in the leaderboard?

like image 655
Georgios Kourogiorgas Avatar asked Mar 09 '19 19:03

Georgios Kourogiorgas


3 Answers

Here's a solution using the example from the H2O AutoML User Guide. The parameters for any model are stored in the model.params location. So if you want to grab the parameters for the leader model, then you can access that here: aml.leader.params. If you wanted another model, you would grab that model into an object in Python using the h2o.get_model() function and similarly, access the params using .params.

The .params object is a dictionary which stores all the parameter values (default and actual).

import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

The top of the leaderboard looks like this:

In [3]: aml.leaderboard
Out[3]:
model_id                                                  auc    logloss    mean_per_class_error      rmse       mse
---------------------------------------------------  --------  ---------  ----------------------  --------  --------
StackedEnsemble_AllModels_AutoML_20190309_152507     0.788879   0.552328                0.315963  0.432607  0.187149
StackedEnsemble_BestOfFamily_AutoML_20190309_152507  0.787642   0.553538                0.317995  0.433144  0.187614
XGBoost_1_AutoML_20190309_152507                     0.785199   0.557134                0.327844  0.434681  0.188948
XGBoost_grid_1_AutoML_20190309_152507_model_4        0.783523   0.557854                0.318819  0.435249  0.189441
XGBoost_grid_1_AutoML_20190309_152507_model_3        0.783004   0.559613                0.325081  0.435708  0.189841
XGBoost_2_AutoML_20190309_152507                     0.782186   0.558342                0.335769  0.435571  0.189722
XGBoost_3_AutoML_20190309_152507                     0.7815     0.55952                 0.319151  0.436034  0.190126
GBM_5_AutoML_20190309_152507                         0.780837   0.559903                0.340848  0.436191  0.190263
GBM_2_AutoML_20190309_152507                         0.780036   0.559806                0.339926  0.436415  0.190458
GBM_1_AutoML_20190309_152507                         0.779827   0.560857                0.335096  0.436616  0.190633

[22 rows x 6 columns]

Here the leader is a Stacked Ensemble. We can look at the parameter names like this:

In [6]: aml.leader.params.keys()
Out[6]: dict_keys(['model_id', 'training_frame', 'response_column', 'validation_frame', 'base_models', 'metalearner_algorithm', 'metalearner_nfolds', 'metalearner_fold_assignment', 'metalearner_fold_column', 'keep_levelone_frame', 'metalearner_params', 'seed', 'export_checkpoints_dir'])
In [7]: aml.leader.params['metalearner_algorithm']
Out[7]: {'default': 'AUTO', 'actual': 'AUTO'}

If you are interested in the GLM (as you mentioned above), then you can grab it like this and examine the hyperparameter values.

# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the GLM model
m = h2o.get_model([mid for mid in model_ids if "GLM" in mid][0])  

Now look at the parameter names and then check out the default and actual values:

In [11]: m.params.keys()
Out[11]: dict_keys(['model_id', 'training_frame', 'validation_frame', 'nfolds', 'seed', 'keep_cross_validation_models', 'keep_cross_validation_predictions', 'keep_cross_validation_fold_assignment', 'fold_assignment', 'fold_column', 'response_column', 'ignored_columns', 'ignore_const_cols', 'score_each_iteration', 'offset_column', 'weights_column', 'family', 'tweedie_variance_power', 'tweedie_link_power', 'solver', 'alpha', 'lambda', 'lambda_search', 'early_stopping', 'nlambdas', 'standardize', 'missing_values_handling', 'compute_p_values', 'remove_collinear_columns', 'intercept', 'non_negative', 'max_iterations', 'objective_epsilon', 'beta_epsilon', 'gradient_epsilon', 'link', 'prior', 'lambda_min_ratio', 'beta_constraints', 'max_active_predictors', 'interactions', 'interaction_pairs', 'obj_reg', 'export_checkpoints_dir', 'balance_classes', 'class_sampling_factors', 'max_after_balance_size', 'max_confusion_matrix_size', 'max_hit_ratio_k', 'max_runtime_secs', 'custom_metric_func'])

In [12]: m.params['nlambdas']
Out[12]: {'default': -1, 'actual': 30}
like image 181
Erin LeDell Avatar answered Oct 21 '22 07:10

Erin LeDell


To further Erin LeDell's answer, if you want to use the BestOfFamily model as recommended by the AutoMl documentation ("The 'Best of Family' ensemble is optimized for production use since it only contains six (or fewer) base_models"):

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

Getting the hyperparameters of the base_models so that you can retrain on different data is a bit more involved:

Similar to the last answer, we can start by outputting the leaderboard:

from h2o.automl import H2OAutoML
aml = H2OAutoML(max_runtime_secs=int(60*30), seed = 1)
aml.train(x=predictors, y=response, training_frame=df_h20)
lb = aml.leaderboard
lbdf = lb.as_data_frame()
lbdf.head()

yields:

AutoML progress: |████████████████████████████████████████████████████████| 100%

model_id    mean_residual_deviance  rmse    mse mae rmsle
0   StackedEnsemble_BestOfFamily_AutoML_20190618_1...   6.960772    2.638328    6.960772    1.880983    0.049275
1   StackedEnsemble_AllModels_AutoML_20190618_145827    6.960772    2.638328    6.960772    1.880983    0.049275
2   GBM_1_AutoML_20190618_145827    7.507970    2.740068    7.507970    1.934916    0.050984
3   DRF_1_AutoML_20190618_145827    7.781256    2.789490    7.781256    1.959508    0.051684
4   GLM_grid_1_AutoML_20190618_145827_model_1   9.503375    3.082754    9.503375    2.273755    0.058174
5   GBM_2_AutoML_20190618_145827    18.464452   4.297028    18.464452   3.259346    0.079722

However, using m.params.keys() shows no way of getting the base_model hyperparameters:

model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model(model_ids[0])
m.params['base_models']

returning:

{'default': [],
 'actual': [{'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GBM_1_AutoML_20190618_145827',
   'type': 'Key<Model>',
   'URL': '/3/Models/GBM_1_AutoML_20190618_145827'},
  {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'DRF_1_AutoML_20190618_145827',
   'type': 'Key<Model>',
   'URL': '/3/Models/DRF_1_AutoML_20190618_145827'},
  {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GLM_grid_1_AutoML_20190618_145827_model_1',
   'type': 'Key<Model>',
   'URL': '/3/Models/GLM_grid_1_AutoML_20190618_145827_model_1'}]}

You have to get a list of the URL of every base_model:

urllist = []
for model in m.params['base_models']['actual']:
    urllist.append(model['URL'])

print(urllist)

giving:

['/3/Models/GBM_1_AutoML_20190618_145827', '/3/Models/DRF_1_AutoML_20190618_145827', '/3/Models/GLM_grid_1_AutoML_20190618_145827_model_1']

And then after that, you can see which hyperparameters are non-default by using the requests library:

for url in urllist:
    r = requests.get("http://localhost:54321"+url)
    model = r.json()
    print(url)

    for i in np.arange(len(model['models'][0]['parameters'])):

        if model['models'][0]['parameters'][i]['label'] in ['model_id','training_frame','validation_frame','response_column']:
            continue

        if model['models'][0]['parameters'][i]['default_value'] != model['models'][0]['parameters'][i]['actual_value']:
            print(model['models'][0]['parameters'][i]['label'])
            print(model['models'][0]['parameters'][i]['actual_value'])
            print(" ")
like image 37
David Jacques Avatar answered Oct 21 '22 08:10

David Jacques


Besides the above, you can connect to the H2O server (FLOW) at the local URL "http://localhost:54321" (or other port you are running on) and click on the model you want and inspect the parameters.

like image 39
Stavros Limberopoulos Avatar answered Oct 21 '22 08:10

Stavros Limberopoulos