Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xgboost: handling of missing values for split candidate search

in section 3.4 of their article, the authors explain how they handle missing values when searching the best candidate split for tree growing. Specifically, they create a default direction for those nodes with, as splitting feature, one with missing values in the current instance set. At prediction time, if the prediction path goes through this node and the feature value is missing, the default direction is followed.

However the prediction phase would break down when the feature values is missing and the node does not have a default direction (and this can occur in many scenarios). In other words, how do they associate a default direction to all nodes, even those with missing-free splitting feature in the active instance set at training time?

like image 548
pmarini Avatar asked Jun 03 '16 14:06

pmarini


People also ask

Can XGBoost handle missing values?

How to deal with missing values. XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training. Note that the gblinear booster treats missing values as zeros.

How does XGBoost handle sparse data?

XGBoost can take a sparse matrix as input. This allows you to convert categorical variables with high cardinality into a dummy matrix, then build a model without getting an out of memory error.

Can tree based models handle missing values?

So, if there is high non-linearity between the independent variables, Decision Trees may outperform as compared to other curve based algorithms. Decision Tree can automatically handle missing values.

Can random forest handle missing values?

Typically, random forest methods/packages encourage two ways of handling missing values: a) drop data points with missing values (not recommended); b) fill in missing values with the median (for numerical values) or mode (for categorical values).


1 Answers

xgboost always accounts for a missing value split direction even if none are present is training. The default is the yes direction in the split criterion. Then it is learned if there are any present in training

From the author link

enter image description here

This can be observed by the following code

    require(xgboost)

    data(agaricus.train, package='xgboost')

    sum(is.na(agaricus.train$data))
    ##[1] 0  

    bst <- xgboost(data = agaricus.train$data, 
                       label = agaricus.train$label, 
                       max.depth = 4, 
                       eta = .01, 
                       nround = 100,
                       nthread = 2, 
                       objective = "binary:logistic")

dt <- xgb.model.dt.tree(model = bst)  ## records all the splits 

> head(dt)
     ID Feature        Split  Yes   No Missing      Quality   Cover Tree Yes.Feature Yes.Cover  Yes.Quality
1:  0-0      28 -1.00136e-05  0-1  0-2     0-1 4000.5300000 1628.25    0          55    924.50 1158.2100000
2:  0-1      55 -1.00136e-05  0-3  0-4     0-3 1158.2100000  924.50    0           7    679.75   13.9060000
3: 0-10    Leaf           NA   NA   NA      NA   -0.0198104  104.50    0          NA        NA           NA
4: 0-11       7 -1.00136e-05 0-15 0-16    0-15   13.9060000  679.75    0        Leaf    763.00    0.0195026
5: 0-12      38 -1.00136e-05 0-17 0-18    0-17   28.7763000   10.75    0        Leaf    678.75   -0.0199117
6: 0-13    Leaf           NA   NA   NA      NA    0.0195026  763.00    0          NA        NA           NA
   No.Feature No.Cover No.Quality
1:       Leaf   104.50 -0.0198104
2:         38    10.75 28.7763000
3:         NA       NA         NA
4:       Leaf     9.50 -0.0180952
5:       Leaf     1.00  0.0100000
6:         NA       NA         NA

> all(dt$Missing == dt$Yes,na.rm = T)
[1] TRUE

source code https://github.com/tqchen/xgboost/blob/8130778742cbdfa406b62de85b0c4e80b9788821/src/tree/model.h#L542

like image 88
T. Scharf Avatar answered Sep 17 '22 05:09

T. Scharf