Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How do boosted regression trees deal with missing data? [closed]

Tags:

r

tree

regression

How does the R implementation of boosted regression trees (package gbm) by default deal with missing values of the predictor variables? Are they imputed and if they are, according to which algorithm?

Background of my question: I did the analysis almost a year ago and I used the scripts provided by Elith et al. 2008 (A working guide to boosted regression trees, Journal of Animal Ecology 77, 802–813) to invoke gbm. I now got aware that I had NAs for some of the predictive variables and I wonder how the boosted regression trees dealt with them. Browsing through various manuals and papers I found statements like "boosted regression trees can accomodate missing values" and the like, but I couldn't find a precise description of what gbm is doing with missing values. The analysis itself ran without problems, so gbm must have dealt with them in one or the other way. In the gbm manual there is even an example where deliberately NAs are introduced to demonstrate that gbm keeps working without problems. Now I'd like to know what gbm precisely does with NAs (skip them, impute them,...?).

like image 792
user7417 Avatar asked Sep 06 '13 12:09

user7417


People also ask

Can decision trees handle missing data?

Decision Tree can automatically handle missing values. Decision Tree is usually robust to outliers and can handle them automatically.

How do Boosted regression trees work?

Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance).

When would you use a boosted regression tree?

Boosted Regression Trees are a powerful algorithm and work very well with large datasets or when you have a large number of environmental variables compared to the number of observations, and they are very robust to missing values and outliers.


1 Answers

The gbm function can be used for imputation as described in Jeffrey Wongs blog:. Missing values get surrogate splits and the user can then get predictions for iems with incompleted predictor sets.

He has developed a package based on this approach. The GitHub repo has this in the header to one of the files for gbm:

#' GBM Imputation
#'
#' Imputation using Boosted Trees
#' Fill each column by treating it as a regression problem. For each
#' column i, use boosted regression trees to predict i using all other
#' columns except i. If the predictor variables also contain missing data,
#' the gbm function will itself use surrogate variables as substitutes for the predictors.
#' This imputation function can handle both categorical and numeric data.

To find this I merely typed this into a Google search: how does gbm deal with missing values. It was the 2nd hit for me.

https://github.com/jeffwong/imputation/blob/master/R/gbmImpute.R

like image 168
IRTFM Avatar answered Sep 23 '22 23:09

IRTFM