I am trying to build a stacked ensemble model to predict merchant churn using R (version 3.3.3) and deep learning in h2o (version 3.10.5.1). The response variable is binary. At the moment I am trying run the code to build a stacked ensemble model using the top 5 models developed by the grid search. However, when the code is run, I get the java.lang.NullPointerException error with the following output:
java.lang.NullPointerException
at hex.StackedEnsembleModel.checkAndInheritModelProperties(StackedEnsembleModel.java:265)
at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeImpl(StackedEnsemble.java:115)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Below is the code that I've used to do the hyper-parameter grid search and build the ensemble model:
hyper_params <- list(
activation=c("Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"),
hidden=list(c(50,50),c(30,30,30),c(32,32,32,32,32),c(64,64,64,64,64),c(100,100,100,100,100)),
input_dropout_ratio=seq(0,0.2,0.05),
l1=seq(0,1e-4,1e-6),
l2=seq(0,1e-4,1e-6),
rho = c(0.9,0.95,0.99,0.999),
epsilon=c(1e-10,1e-09,1e-08,1e-07,1e-06,1e-05,1e-04)
)
search_criteria <- list(
strategy = "RandomDiscrete",
max_runtime_secs = 3600,
max_models = 100,
seed=1234,
stopping_metric="misclassification",
stopping_tolerance=0.01,
stopping_rounds=5
)
dl_ensemble_grid <- h2o.grid(
hyper_params = hyper_params,
search_criteria = search_criteria,
algorithm="deeplearning",
grid_id = "final_grid_ensemble_dl",
x=predictors,
y=response,
training_frame = h2o.rbind(train, valid, test),
nfolds=5,
fold_assignment="Modulo",
keep_cross_validation_predictions = TRUE,
keep_cross_validation_fold_assignment = TRUE,
epochs=12,
max_runtime_secs = 3600,
stopping_metric="misclassification",
stopping_tolerance=0.01,
stopping_rounds=5,
seed = 1234,
max_w2=10
)
DLsortedGridEnsemble_logloss <- h2o.getGrid("final_grid_ensemble_dl",sort_by="logloss",decreasing=FALSE)
ensemble <- h2o.stackedEnsemble(x = predictors,
y = response,
training_frame = h2o.rbind(train,valid,test),
base_models = list(
DLsortedGridEnsemble_logloss@model_ids[[1]],
DLsortedGridEnsemble_logloss@model_ids[[2]],
DLsortedGridEnsemble_logloss@model_ids[[3]],
DLsortedGridEnsemble_logloss@model_ids[[4]],
DLsortedGridEnsemble_logloss@model_ids[[5]],
)
Note: what I have realised so far is that h2o.stackedEnsemble function works when there's only one base model and it gives the Java error as soon as there's two or more base models.
I would really appreciate if I could get some feedback as to how this could be resolved.
The error refers to a line of the StackedEnsembleModel.java code that checks that the training_frame
in the base models and the training_frame
in h2o.stackedEnsemble()
have the same checksum. I think the problem is caused because you dynamically created the training frame, rather than defining it explicitly (even though that should work since it's the same data in the end). So, rather than setting training_frame = h2o.rbind(train, valid, test)
in the grid and ensemble functions, set the following at the top of your code:
df <- h2o.rbind(train, valid, test)
And then set training_frame = df
in the grid and ensemble functions.
As a side note, you may get better DL models if you use a validation frame (for early stopping), rather than using all your data for the training frame. Also, if you want to use all the models in your grid (might lead to better performance, but not always), you can set base_models = DLsortedGridEnsemble_logloss@model_ids
in the h2o.stackedEnsemble()
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With