Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

H2O R api: retrieving optimal model from grid search

Tags:

python

r

h2o

I'm using the h2o package (v 3.6.0) in R, and I've built a grid search model. Now, I'm trying to access the model which minimizes MSE on the validation set. In python's sklearn, this is easily achievable when using RandomizedSearchCV:

## Pseudo code:
grid = RandomizedSearchCV(model, params, n_iter = 5)
grid.fit(X)
best = grid.best_estimator_

This, unfortunately, does not prove as straightforward in h2o. Here's an example you can recreate:

library(h2o)
## assume you got h2o initialized...

X <- as.h2o(iris[1:100,]) # Note: only using top two classes for example 
grid <- h2o.grid(
    algorithm = 'gbm',
    x = names(X[,1:4]),
    y = 'Species',
    training_frame = X,
    hyper_params = list(
        distribution = 'bernoulli',
        ntrees = c(25,50)
    )
)

Viewing grid prints a wealth of information, including this portion:

> grid
ntrees distribution status_ok                                                                 model_ids
 50    bernoulli        OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1
 25    bernoulli        OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0

With a bit of digging, you can access each individual model and view every metric imaginable:

> h2o.getModel(grid@model_ids[[1]])
H2OBinomialModel: gbm
Model ID:  Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1 
Model Summary: 
  number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1              50                4387         1         1    1.00000          2          2     2.00000


H2OBinomialMetrics: gbm
** Reported on training data. **

MSE:  1.056927e-05
R^2:  0.9999577
LogLoss:  0.003256338
AUC:  1
Gini:  1

Confusion Matrix for F1-optimal threshold:
           setosa versicolor    Error    Rate
setosa         50          0 0.000000   =0/50
versicolor      0         50 0.000000   =0/50
Totals         50         50 0.000000  =0/100

Maximum Metrics: Maximum metrics at their respective thresholds
                      metric threshold    value idx
1                     max f1  0.996749 1.000000   0
2                     max f2  0.996749 1.000000   0
3               max f0point5  0.996749 1.000000   0
4               max accuracy  0.996749 1.000000   0
5              max precision  0.996749 1.000000   0
6           max absolute_MCC  0.996749 1.000000   0
7 max min_per_class_accuracy  0.996749 1.000000   0

And with a lot of digging, you can finally get to this:

> h2o.getModel(grid@model_ids[[1]])@model$training_metrics@metrics$MSE
[1] 1.056927e-05

This seems like a lot of kludgey work to get down to a metric that ought to be top-level for model selection. In my situation, I've got a grid with hundreds of models, and my current, hacky solution just doesn't seems very "R-esque":

model_select_ <- function(grid) {
  model_ids <- grid@model_ids
  min = Inf
  best_model = NULL

  for(model_id in model_ids) {
    model <- h2o.getModel(model_id)
    mse <- model@model$training_metrics@metrics$MSE
    if(mse < min) {
      min <- mse
      best_model <- model
    }
  }

  best_model
}

This seems like overkill for something that is so core to the practice of machine learning, and it just strikes me as odd that h2o would not have a "cleaner" method of extracting the optimal model, or at least model metrics.

Am I missing something? Is there no "out of the box" method for selecting the best model?

like image 290
TayTay Avatar asked Feb 26 '16 17:02

TayTay


Video Answer


2 Answers

Yes, there is an easy way to extract the "top" model of an H2O grid search. There are also utility functions that will extract all the model metrics (e.g. h2o.mse) that you have been trying to access. Examples of how to do these things can be found in the h2o-r/demos and h2o-py/demos subfolders on the h2o-3 GitHub repo.

Since you are using R, here is a relevant code example that includes a grid search, with sorted results. You can also find how to access this information in the R documentation for the h2o.getGrid function.

Print out the auc for all of the models, sorted by validation AUC:

auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE)
print(auc_table)

Here is an example of the output:

H2O Grid Details
================

Grid ID: eeg_demo_gbm_grid 
Used hyper parameters: 
  -  ntrees 
  -  max_depth 
  -  learn_rate 
Number of models: 18 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   ntrees max_depth learn_rate                  model_ids               auc
1     100         5        0.2 eeg_demo_gbm_grid_model_17 0.967771493797284
2      50         5        0.2 eeg_demo_gbm_grid_model_16 0.949609591795923
3     100         5        0.1  eeg_demo_gbm_grid_model_8  0.94941792664595
4      50         5        0.1  eeg_demo_gbm_grid_model_7 0.922075196552274
5     100         3        0.2 eeg_demo_gbm_grid_model_14 0.913785959685157
6      50         3        0.2 eeg_demo_gbm_grid_model_13 0.887706691652792
7     100         3        0.1  eeg_demo_gbm_grid_model_5 0.884064379717198
8       5         5        0.2 eeg_demo_gbm_grid_model_15 0.851187402678818
9      50         3        0.1  eeg_demo_gbm_grid_model_4 0.848921799270639
10      5         5        0.1  eeg_demo_gbm_grid_model_6 0.825662907513139
11    100         2        0.2 eeg_demo_gbm_grid_model_11 0.812030639460551
12     50         2        0.2 eeg_demo_gbm_grid_model_10 0.785379521713437
13    100         2        0.1  eeg_demo_gbm_grid_model_2  0.78299280750123
14      5         3        0.2 eeg_demo_gbm_grid_model_12 0.774673686150002
15     50         2        0.1  eeg_demo_gbm_grid_model_1 0.754834657912535
16      5         3        0.1  eeg_demo_gbm_grid_model_3 0.749285131682721
17      5         2        0.2  eeg_demo_gbm_grid_model_9 0.692702793188135
18      5         2        0.1  eeg_demo_gbm_grid_model_0 0.676144542037133

The top row in the table contains the model with the best AUC, so below we can grab that model and extract the validation AUC:

best_model <- h2o.getModel(auc_table@model_ids[[1]])
h2o.auc(best_model, valid = TRUE)

In order for the h2o.getGrid function to be able sort by a metric on the validation set, you need to actually pass the h2o.grid function a validation_frame. In your example above, you did not pass a validation_frame, so you can't evaluate the models in the grid on the validation set.

like image 108
Erin LeDell Avatar answered Nov 15 '22 03:11

Erin LeDell


This seems to be valid for recent versions of h2o only, with 3.8.2.3 you get a Java exception saying that "auc" is an invalid metric. The following fails :

library(h2o)
library(jsonlite)
h2o.init()
iris.hex <- as.h2o(iris)
h2o.grid("gbm", grid_id = "gbm_grid_id", x = c(1:4), y = 5,
     training_frame = iris.hex, hyper_params = list(ntrees = c(1,2,3)))
grid <- h2o.getGrid("gbm_grid_id", sort_by = "auc", decreasing = T)

However, replace 'auc' with 'logloss' and decrease = F, and it's fine.

like image 28
RobertoRonaldo Avatar answered Nov 15 '22 03:11

RobertoRonaldo