I'm using the h2o
package (v 3.6.0) in R, and I've built a grid search model. Now, I'm trying to access the model which minimizes MSE on the validation set. In python's sklearn
, this is easily achievable when using RandomizedSearchCV
:
## Pseudo code:
grid = RandomizedSearchCV(model, params, n_iter = 5)
grid.fit(X)
best = grid.best_estimator_
This, unfortunately, does not prove as straightforward in h2o. Here's an example you can recreate:
library(h2o)
## assume you got h2o initialized...
X <- as.h2o(iris[1:100,]) # Note: only using top two classes for example
grid <- h2o.grid(
algorithm = 'gbm',
x = names(X[,1:4]),
y = 'Species',
training_frame = X,
hyper_params = list(
distribution = 'bernoulli',
ntrees = c(25,50)
)
)
Viewing grid
prints a wealth of information, including this portion:
> grid
ntrees distribution status_ok model_ids
50 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1
25 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0
With a bit of digging, you can access each individual model and view every metric imaginable:
> h2o.getModel(grid@model_ids[[1]])
H2OBinomialModel: gbm
Model ID: Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1
Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 50 4387 1 1 1.00000 2 2 2.00000
H2OBinomialMetrics: gbm
** Reported on training data. **
MSE: 1.056927e-05
R^2: 0.9999577
LogLoss: 0.003256338
AUC: 1
Gini: 1
Confusion Matrix for F1-optimal threshold:
setosa versicolor Error Rate
setosa 50 0 0.000000 =0/50
versicolor 0 50 0.000000 =0/50
Totals 50 50 0.000000 =0/100
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.996749 1.000000 0
2 max f2 0.996749 1.000000 0
3 max f0point5 0.996749 1.000000 0
4 max accuracy 0.996749 1.000000 0
5 max precision 0.996749 1.000000 0
6 max absolute_MCC 0.996749 1.000000 0
7 max min_per_class_accuracy 0.996749 1.000000 0
And with a lot of digging, you can finally get to this:
> h2o.getModel(grid@model_ids[[1]])@model$training_metrics@metrics$MSE
[1] 1.056927e-05
This seems like a lot of kludgey work to get down to a metric that ought to be top-level for model selection. In my situation, I've got a grid with hundreds of models, and my current, hacky solution just doesn't seems very "R-esque":
model_select_ <- function(grid) {
model_ids <- grid@model_ids
min = Inf
best_model = NULL
for(model_id in model_ids) {
model <- h2o.getModel(model_id)
mse <- model@model$training_metrics@metrics$MSE
if(mse < min) {
min <- mse
best_model <- model
}
}
best_model
}
This seems like overkill for something that is so core to the practice of machine learning, and it just strikes me as odd that h2o would not have a "cleaner" method of extracting the optimal model, or at least model metrics.
Am I missing something? Is there no "out of the box" method for selecting the best model?
Yes, there is an easy way to extract the "top" model of an H2O grid search. There are also utility functions that will extract all the model metrics (e.g. h2o.mse
) that you have been trying to access. Examples of how to do these things can be found in the h2o-r/demos and h2o-py/demos subfolders on the h2o-3 GitHub repo.
Since you are using R, here is a relevant code example that includes a grid search, with sorted results. You can also find how to access this information in the R documentation for the h2o.getGrid
function.
Print out the auc for all of the models, sorted by validation AUC:
auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE)
print(auc_table)
Here is an example of the output:
H2O Grid Details
================
Grid ID: eeg_demo_gbm_grid
Used hyper parameters:
- ntrees
- max_depth
- learn_rate
Number of models: 18
Number of failed models: 0
Hyper-Parameter Search Summary: ordered by decreasing auc
ntrees max_depth learn_rate model_ids auc
1 100 5 0.2 eeg_demo_gbm_grid_model_17 0.967771493797284
2 50 5 0.2 eeg_demo_gbm_grid_model_16 0.949609591795923
3 100 5 0.1 eeg_demo_gbm_grid_model_8 0.94941792664595
4 50 5 0.1 eeg_demo_gbm_grid_model_7 0.922075196552274
5 100 3 0.2 eeg_demo_gbm_grid_model_14 0.913785959685157
6 50 3 0.2 eeg_demo_gbm_grid_model_13 0.887706691652792
7 100 3 0.1 eeg_demo_gbm_grid_model_5 0.884064379717198
8 5 5 0.2 eeg_demo_gbm_grid_model_15 0.851187402678818
9 50 3 0.1 eeg_demo_gbm_grid_model_4 0.848921799270639
10 5 5 0.1 eeg_demo_gbm_grid_model_6 0.825662907513139
11 100 2 0.2 eeg_demo_gbm_grid_model_11 0.812030639460551
12 50 2 0.2 eeg_demo_gbm_grid_model_10 0.785379521713437
13 100 2 0.1 eeg_demo_gbm_grid_model_2 0.78299280750123
14 5 3 0.2 eeg_demo_gbm_grid_model_12 0.774673686150002
15 50 2 0.1 eeg_demo_gbm_grid_model_1 0.754834657912535
16 5 3 0.1 eeg_demo_gbm_grid_model_3 0.749285131682721
17 5 2 0.2 eeg_demo_gbm_grid_model_9 0.692702793188135
18 5 2 0.1 eeg_demo_gbm_grid_model_0 0.676144542037133
The top row in the table contains the model with the best AUC, so below we can grab that model and extract the validation AUC:
best_model <- h2o.getModel(auc_table@model_ids[[1]])
h2o.auc(best_model, valid = TRUE)
In order for the h2o.getGrid
function to be able sort by a metric on the validation set, you need to actually pass the h2o.grid
function a validation_frame
. In your example above, you did not pass a validation_frame, so you can't evaluate the models in the grid on the validation set.
This seems to be valid for recent versions of h2o only, with 3.8.2.3 you get a Java exception saying that "auc" is an invalid metric. The following fails :
library(h2o)
library(jsonlite)
h2o.init()
iris.hex <- as.h2o(iris)
h2o.grid("gbm", grid_id = "gbm_grid_id", x = c(1:4), y = 5,
training_frame = iris.hex, hyper_params = list(ntrees = c(1,2,3)))
grid <- h2o.getGrid("gbm_grid_id", sort_by = "auc", decreasing = T)
However, replace 'auc' with 'logloss' and decrease = F, and it's fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With