How does the R implementation of boosted regression trees (package gbm) by default deal with missing values of the predictor variables? Are they imputed and if they are, according to which algorithm?
Background of my question: I did the analysis almost a year ago and I used the scripts provided by Elith et al. 2008 (A working guide to boosted regression trees, Journal of Animal Ecology 77, 802–813) to invoke gbm. I now got aware that I had NAs for some of the predictive variables and I wonder how the boosted regression trees dealt with them. Browsing through various manuals and papers I found statements like "boosted regression trees can accomodate missing values" and the like, but I couldn't find a precise description of what gbm is doing with missing values. The analysis itself ran without problems, so gbm must have dealt with them in one or the other way. In the gbm manual there is even an example where deliberately NAs are introduced to demonstrate that gbm keeps working without problems. Now I'd like to know what gbm precisely does with NAs (skip them, impute them,...?).
Decision Tree can automatically handle missing values. Decision Tree is usually robust to outliers and can handle them automatically.
Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance).
Boosted Regression Trees are a powerful algorithm and work very well with large datasets or when you have a large number of environmental variables compared to the number of observations, and they are very robust to missing values and outliers.
The gbm function can be used for imputation as described in Jeffrey Wongs blog:. Missing values get surrogate splits and the user can then get predictions for iems with incompleted predictor sets.
He has developed a package based on this approach. The GitHub repo has this in the header to one of the files for gbm:
#' GBM Imputation
#'
#' Imputation using Boosted Trees
#' Fill each column by treating it as a regression problem. For each
#' column i, use boosted regression trees to predict i using all other
#' columns except i. If the predictor variables also contain missing data,
#' the gbm function will itself use surrogate variables as substitutes for the predictors.
#' This imputation function can handle both categorical and numeric data.
To find this I merely typed this into a Google search: how does gbm deal with missing values. It was the 2nd hit for me.
https://github.com/jeffwong/imputation/blob/master/R/gbmImpute.R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With