I'm trying to explore the use of a GBM with <code>h2o</code> for a classification issue to replace a logistic regression (GLM). The non-linearity and interactions in my data make me think a GBM is more suitable. I've ran a baseline GBM (see below) and compared the AUC against the AUC of the logistic regression. THe GBM performs much better. In a classic linear logistic regression, one would be able to see the direction and effect of each of the predictors (x) on the outcome variable (y). Now, I would like to evaluate the variable importance of the estimate GBM in the same way. How does one obtain the variable importance for each of the (two) classes? I know that the variable importance is not the same as the estimated coefficient in a logistic regression, but it would help me to understand which predictor impacts what class. Others have asked similar questions, but the answers provided won't work for the H2O object. Any help is much appreciated. <pre class="prettyprint"><code>example.gbm <- h2o.gbm( x = c("list of predictors"), y = "binary response variable", training_frame = data, max_runtime_secs = 1800, nfolds=5, stopping_metric = "AUC") </code></pre>

AFAIS, the more powerful a machine learning method, the more complex to explain what's going on beneath it. The advantages of <code>GBM</code> method (as you mentioned already) also bring in difficulties to understand the model. This is especailly true for numeric varialbes when a <code>GBM</code> model may utilise value ranges differently that some may have positive impacts whereas others have negative effects. For <code>GLM</code>, when there is no interaction specified, a numeric variable would be monotonic, hence you can have positive or negative impact examed. Now that a total view is difficult, is there any method we can analyse the model? There are 2 methods we can start with: <h3>Partial Dependence Plot</h3> <code>h2o</code> provides <code>h2o.partialplot</code> that gives the partial (i.e. marginal) effect for each variable, which can be seen as the effect: <pre class="prettyprint"><code>library(h2o) h2o.init() prostate.path <- system.file("extdata", "prostate.csv", package="h2o") prostate.hex <- h2o.uploadFile(path = prostate.path, destination_frame = "prostate.hex") prostate.hex[, "CAPSULE"] <- as.factor(prostate.hex[, "CAPSULE"] ) prostate.hex[, "RACE"] <- as.factor(prostate.hex[,"RACE"] ) prostate.gbm <- h2o.gbm(x = c("AGE","RACE"), y = "CAPSULE", training_frame = prostate.hex, ntrees = 10, max_depth = 5, learn_rate = 0.1) h2o.partialPlot(object = prostate.gbm, data = prostate.hex, cols = "AGE") </code></pre> <img src="https://i.stack.imgur.com/TkruW.png" alt="enter image description here"> <h3>Individual Analyser</h3> <code>LIME</code> package [https://github.com/thomasp85/lime] provides capability to check variables contribution for each of observations. Luckily, this r package supports <code>h2o</code> already. <img src="https://i.stack.imgur.com/uiKEE.png" alt="enter image description here">

How to get different Variable Importance for each class in a binary h2o GBM in R?

Tags:

I'm trying to explore the use of a GBM with h2o for a classification issue to replace a logistic regression (GLM). The non-linearity and interactions in my data make me think a GBM is more suitable.

I've ran a baseline GBM (see below) and compared the AUC against the AUC of the logistic regression. THe GBM performs much better.

In a classic linear logistic regression, one would be able to see the direction and effect of each of the predictors (x) on the outcome variable (y).

Now, I would like to evaluate the variable importance of the estimate GBM in the same way.

How does one obtain the variable importance for each of the (two) classes?

I know that the variable importance is not the same as the estimated coefficient in a logistic regression, but it would help me to understand which predictor impacts what class.

Others have asked similar questions, but the answers provided won't work for the H2O object.

Any help is much appreciated.

example.gbm <- h2o.gbm(   x = c("list of predictors"),    y = "binary response variable",    training_frame = data,    max_runtime_secs = 1800,    nfolds=5,   stopping_metric = "AUC")

351

asked Dec 02 '17 15:12

wake_wake

1 Answers

AFAIS, the more powerful a machine learning method, the more complex to explain what's going on beneath it.

The advantages of GBM method (as you mentioned already) also bring in difficulties to understand the model. This is especailly true for numeric varialbes when a GBM model may utilise value ranges differently that some may have positive impacts whereas others have negative effects.

For GLM, when there is no interaction specified, a numeric variable would be monotonic, hence you can have positive or negative impact examed.

Now that a total view is difficult, is there any method we can analyse the model? There are 2 methods we can start with:

Partial Dependence Plot

h2o provides h2o.partialplot that gives the partial (i.e. marginal) effect for each variable, which can be seen as the effect:

library(h2o) h2o.init() prostate.path <- system.file("extdata", "prostate.csv", package="h2o") prostate.hex <- h2o.uploadFile(path = prostate.path, destination_frame = "prostate.hex") prostate.hex[, "CAPSULE"] <- as.factor(prostate.hex[, "CAPSULE"] ) prostate.hex[, "RACE"] <- as.factor(prostate.hex[,"RACE"] ) prostate.gbm <- h2o.gbm(x = c("AGE","RACE"),                        y = "CAPSULE",                        training_frame = prostate.hex,                        ntrees = 10,                        max_depth = 5,                        learn_rate = 0.1) h2o.partialPlot(object = prostate.gbm, data = prostate.hex, cols = "AGE")

enter image description here

Individual Analyser

LIME package [https://github.com/thomasp85/lime] provides capability to check variables contribution for each of observations. Luckily, this r package supports h2o already.

enter image description here

answered Nov 01 '22 10:11

Sixiang.Hu

Related questions
                            
                                Is there a difference when specifying upper bounds for wildcards explicitly?
                            
                                Find if screen has rounded corners
                            
                                Pattern to use Serilog (pass ILogger vs using static Serilog.Log)
                            
                                OVERRIDE_BY_INLINE in Kotlin
                            
                                WKWebView doesn't run JavaScript when on background
                            
                                Children processes created in ASP.NET Core Process gets killed on exit
                            
                                parallelStream() causing ClassNotFoundException with JAXB-API
                            
                                SwiftUI - Use @Binding with Core Data NSManagedObject?
                            
                                will Cloud Function affect Firebase Storage bandwidth usage?
                            
                                Can you define your own template variables in Eclipse
                            
                                Recursive declaration of function pointer in C
                            
                                How can I upload a file via ASP.NET MVC and show a progress bar?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With