I'm trying to explore the use of a GBM with h2o
for a classification issue to replace a logistic regression (GLM). The non-linearity and interactions in my data make me think a GBM is more suitable.
I've ran a baseline GBM (see below) and compared the AUC against the AUC of the logistic regression. THe GBM performs much better.
In a classic linear logistic regression, one would be able to see the direction and effect of each of the predictors (x) on the outcome variable (y).
Now, I would like to evaluate the variable importance of the estimate GBM in the same way.
How does one obtain the variable importance for each of the (two) classes?
I know that the variable importance is not the same as the estimated coefficient in a logistic regression, but it would help me to understand which predictor impacts what class.
Others have asked similar questions, but the answers provided won't work for the H2O object.
Any help is much appreciated.
example.gbm <- h2o.gbm( x = c("list of predictors"), y = "binary response variable", training_frame = data, max_runtime_secs = 1800, nfolds=5, stopping_metric = "AUC")
Variable Importance Calculation (GBM & DRF) Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected to split on during the tree building process, and how much the squared error (over all trees) improved (decreased) as a result.
This approach, which we call TREEWEIGHT, calculates the feature importance score for a variable by summing the impurity reductions over all nodes in the tree where a split was made on that variable, with impurity reductions weighted to account for the size of the node.
(My) definition: Variable importance refers to how much a given model "uses" that variable to make accurate predictions. The more a model relies on a variable to make predictions, the more important it is for the model.
Variable importance plot provides a list of the most significant variables in descending order by a mean decrease in Gini. The top variables contribute more to the model than the bottom ones and also have high predictive power in classifying default and non-default customers.
AFAIS, the more powerful a machine learning method, the more complex to explain what's going on beneath it.
The advantages of GBM
method (as you mentioned already) also bring in difficulties to understand the model. This is especailly true for numeric varialbes when a GBM
model may utilise value ranges differently that some may have positive impacts whereas others have negative effects.
For GLM
, when there is no interaction specified, a numeric variable would be monotonic, hence you can have positive or negative impact examed.
Now that a total view is difficult, is there any method we can analyse the model? There are 2 methods we can start with:
h2o
provides h2o.partialplot
that gives the partial (i.e. marginal) effect for each variable, which can be seen as the effect:
library(h2o) h2o.init() prostate.path <- system.file("extdata", "prostate.csv", package="h2o") prostate.hex <- h2o.uploadFile(path = prostate.path, destination_frame = "prostate.hex") prostate.hex[, "CAPSULE"] <- as.factor(prostate.hex[, "CAPSULE"] ) prostate.hex[, "RACE"] <- as.factor(prostate.hex[,"RACE"] ) prostate.gbm <- h2o.gbm(x = c("AGE","RACE"), y = "CAPSULE", training_frame = prostate.hex, ntrees = 10, max_depth = 5, learn_rate = 0.1) h2o.partialPlot(object = prostate.gbm, data = prostate.hex, cols = "AGE")
LIME
package [https://github.com/thomasp85/lime] provides capability to check variables contribution for each of observations. Luckily, this r package supports h2o
already.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With