The xgboost package allows to build a random forest (in fact, it chooses a random subset of columns to choose a variable for a split for the whole tree, not for a nod, as it is in a classical version of the algorithm, but it can be tolerated). But it seems that for regression only one tree from the forest (maybe, the last one built) is used. To ensure that, consider just a standard toy example. <pre class="prettyprint"><code>library(xgboost) library(randomForest) data(agaricus.train, package = 'xgboost') dtrain = xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) bst = xgb.train(data = dtrain, nround = 1, subsample = 0.8, colsample_bytree = 0.5, num_parallel_tree = 100, verbose = 2, max_depth = 12) answer1 = predict(bst, dtrain); (answer1 - agaricus.train$label) %*% (answer1 - agaricus.train$label) forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50) answer2 = predict(forest, as.matrix(agaricus.train$data)) (answer2 - agaricus.train$label) %*% (answer2 - agaricus.train$label) </code></pre> Yes, of course, the default version of the xgboost random forest uses not a Gini score function but just the MSE; it can be changed easily. Also it is not correct to do such a validation and so on, so on. It does not affect a main problem. Regardless of which sets of parameters are being tried results are suprisingly bad compared with the randomForest implementation. This holds for another data sets as well. Could anybody provide a hint on such strange behaviour? When it comes to the classification task the algorithm does work as expected. # Well, all trees are grown and all are used to make a prediction. You may check that using the parameter 'ntreelimit' for the 'predict' function. The main problem remains: is the specific form of the Random Forest algorithm that is produced by the xgbbost package valid? Cross-validation, parameter tunning and other crap have nothing to do with that -- every one may add necessary corrections to the code and see what happens. You may specify the 'objective' option like this: <pre class="prettyprint"><code>mse = function(predict, dtrain) { real = getinfo(dtrain, 'label') return(list(grad = 2 * (predict - real), hess = rep(2, length(real)))) } </code></pre> This provides that you use the MSE when choosing a variable for the split. Even after that, results are suprisingly bad compared to those of randomForest. Maybe, the problem is of academical nature and concerns the way how a random subset of features to make a split is chosen. The classical implementation chooses a subset of features (the size is specified with 'mtry' for the randomForest package) for EVERY split separately and the xgboost implementation chooses one subset for a tree (specified with 'colsample_bytree'). So this fine difference appears to be of great importance, at least for some types of datasets. It is interesting, indeed.

xgboost(random forest style) does use more than one tree to predict. But there are many other differences to explore. I myself am new to xgboost, but curious. So I wrote the code below to visualize the trees. You can run the code yourself to verify or explore other differences. Your data set of choice is a classification problem as labels are either 0 or 1. I like to switch to a simple regression problem to visualize what xgboost does. true model: $y = x_1 * x_2$ + noise If you train a single tree or multiple tree, with the code examples below you observe that the learned model structure does contain more trees. You cannot argue alone from the prediction accuracy how many trees are trained. Maybe the predictions are different because the implementations are different. None of the ~5 RF implementations I know of are exactly alike, and this xgboost(rf style) is as closest a distant "cousin". I observe the colsample_bytree is not equal to mtry, as the former uses the same subset of variable/columns for the entire tree. My regression problem is one big interaction only, which cannot be learned if trees only uses either x1 or x2. Thus in this case colsample_bytree must be set to 1 to use both variables in all trees. Regular RF could model this problem with mtry=1, as each node would use either X1 or X2 I see your randomForest predictions are not out-of-bag cross-validated. If drawing any conclusions on predictions you must cross-validate, especially for fully grown trees. NB You need to fix the function vec.plot as does not support xgboost out of the box, because xgboost out of some other box do not take data.frame as an valid input. The instruction in the code should be clear <pre class="prettyprint"><code>library(xgboost) library(rgl) library(forestFloor) Data = data.frame(replicate(2,rnorm(5000))) Data$y = Data$X1*Data$X2 + rnorm(5000)*.5 gradientByTarget =fcol(Data,3) plot3d(Data,col=gradientByTarget) #true data structure fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame #16# yhat.vec = predict(model, as.matrix(Xtest.vec)) #21# yhat.obs = predict(model, as.matrix(Xtest.obs)) #1 single deep tree xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y, nrounds=1,params = list(max.depth=250)) vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200) plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) #clearly just one tree #100 trees (gbm boosting) xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y, nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6)) vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated! #20 shallow trees (bagging) xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y, nrounds=1,params = list(max.depth=250, num_parallel_tree=20,colsample_bytree = .5, subsample = .5)) vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!! #problem, colsample_bytree is NOT mtry as columns are only sampled once # (this could be raised as an issue on their github page, that this does not mimic RF) #20 deep tree (bagging), no column limitation xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y, nrounds=1,params = list(max.depth=500, num_parallel_tree=200,colsample_bytree = 1, subsample = .5)) vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #voila model can fit data </code></pre>

The xgboost package and the random forests regression

Tags:

r

random-forest

xgboost

The xgboost package allows to build a random forest (in fact, it chooses a random subset of columns to choose a variable for a split for the whole tree, not for a nod, as it is in a classical version of the algorithm, but it can be tolerated). But it seems that for regression only one tree from the forest (maybe, the last one built) is used.

To ensure that, consider just a standard toy example.

library(xgboost)
library(randomForest)
data(agaricus.train, package = 'xgboost')
    dtrain = xgb.DMatrix(agaricus.train$data,
 label = agaricus.train$label)
 bst = xgb.train(data = dtrain, 
                 nround = 1, 
                 subsample = 0.8, 
                 colsample_bytree = 0.5, 
                 num_parallel_tree = 100, 
                 verbose = 2, 
                 max_depth = 12)

answer1 = predict(bst, dtrain); 
(answer1 - agaricus.train$label) %*% (answer1 -  agaricus.train$label)

forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50)

answer2 = predict(forest, as.matrix(agaricus.train$data))
(answer2 - agaricus.train$label) %*% (answer2 -  agaricus.train$label)

Yes, of course, the default version of the xgboost random forest uses not a Gini score function but just the MSE; it can be changed easily. Also it is not correct to do such a validation and so on, so on. It does not affect a main problem. Regardless of which sets of parameters are being tried results are suprisingly bad compared with the randomForest implementation. This holds for another data sets as well.

Could anybody provide a hint on such strange behaviour? When it comes to the classification task the algorithm does work as expected.

Well, all trees are grown and all are used to make a prediction. You may check that using the parameter 'ntreelimit' for the 'predict' function.

The main problem remains: is the specific form of the Random Forest algorithm that is produced by the xgbbost package valid?

Cross-validation, parameter tunning and other crap have nothing to do with that -- every one may add necessary corrections to the code and see what happens.

You may specify the 'objective' option like this:

mse = function(predict, dtrain)
{
  real = getinfo(dtrain, 'label')
  return(list(grad = 2 * (predict - real),
              hess = rep(2, length(real))))
}

This provides that you use the MSE when choosing a variable for the split. Even after that, results are suprisingly bad compared to those of randomForest.

Maybe, the problem is of academical nature and concerns the way how a random subset of features to make a split is chosen. The classical implementation chooses a subset of features (the size is specified with 'mtry' for the randomForest package) for EVERY split separately and the xgboost implementation chooses one subset for a tree (specified with 'colsample_bytree').

So this fine difference appears to be of great importance, at least for some types of datasets. It is interesting, indeed.

420

asked Jan 19 '16 10:01

mv_

1 Answers

xgboost(random forest style) does use more than one tree to predict. But there are many other differences to explore.

I myself am new to xgboost, but curious. So I wrote the code below to visualize the trees. You can run the code yourself to verify or explore other differences.

Your data set of choice is a classification problem as labels are either 0 or 1. I like to switch to a simple regression problem to visualize what xgboost does.

true model: $y = x_1 * x_2$ + noise

If you train a single tree or multiple tree, with the code examples below you observe that the learned model structure does contain more trees. You cannot argue alone from the prediction accuracy how many trees are trained.

Maybe the predictions are different because the implementations are different. None of the ~5 RF implementations I know of are exactly alike, and this xgboost(rf style) is as closest a distant "cousin".

I observe the colsample_bytree is not equal to mtry, as the former uses the same subset of variable/columns for the entire tree. My regression problem is one big interaction only, which cannot be learned if trees only uses either x1 or x2. Thus in this case colsample_bytree must be set to 1 to use both variables in all trees. Regular RF could model this problem with mtry=1, as each node would use either X1 or X2

I see your randomForest predictions are not out-of-bag cross-validated. If drawing any conclusions on predictions you must cross-validate, especially for fully grown trees.

NB You need to fix the function vec.plot as does not support xgboost out of the box, because xgboost out of some other box do not take data.frame as an valid input. The instruction in the code should be clear

library(xgboost)
library(rgl)
library(forestFloor)
Data = data.frame(replicate(2,rnorm(5000)))
Data$y = Data$X1*Data$X2 + rnorm(5000)*.5
gradientByTarget =fcol(Data,3)
plot3d(Data,col=gradientByTarget) #true data structure

fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame
#16# yhat.vec = predict(model, as.matrix(Xtest.vec))
#21# yhat.obs = predict(model, as.matrix(Xtest.obs))

#1 single deep tree
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=1,params = list(max.depth=250))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget)
#clearly just one tree

#100 trees (gbm boosting)
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) 
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated!


#20 shallow trees (bagging)
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=1,params = list(max.depth=250,
                     num_parallel_tree=20,colsample_bytree = .5, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!!
#problem, colsample_bytree is NOT mtry as columns are only sampled once
# (this could be raised as an issue on their github page, that this does not mimic RF)


#20 deep tree (bagging), no column limitation
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=1,params = list(max.depth=500,
                     num_parallel_tree=200,colsample_bytree = 1, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])))
#voila model can fit data

answered Dec 15 '22 21:12

Soren Havelund Welling

Related questions
                            
                                R grid.arrange: nrow * ncol >= n is not TRUE
                            
                                generating matrices/using outer
                            
                                Remove numbers from string in R
                            
                                Use data.table to calculate the percentage of occurrence depending on the category in another column
                            
                                Plot median values on top of a density distribution
                            
                                Oddity with dplyr and all
                            
                                R: GAM with fit on subset of data
                            
                                How to replace elements of a matrix in C++ with values from another matrix (using Rcpp)?
                            
                                How to get the exact value of factorial(100)
                            
                                How to find three consecutive rows with the same value
                            
                                Extract shapefile value to point with R
                            
                                Double Sapply nested function
                            
                                Extract the hierarchical structure of the nodes in a dendrogram or cluster
                            
                                How to force idle workers to take jobs in parallel R?
                            
                                Set seed with cv.glmnet paralleled gives different results in R
                            
                                assign colors to each level of factors in R figures
                            
                                Assigning groups using grepl with multiple inputs
                            
                                what is default color of smooth curve in ggplot2?
                            
                                Control layout when displaying a series of ggplot plots stored in a list
                            
                                Monte carlo integration not working?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With