Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in memory usage between gbm and blackboost

Tags:

r

gbm

I'm working on a database with around 250000 observation and 50 predictors (some are factors so in the end around 100 features) and I have trouble using the blackboost() function (from mboost package) which give me a memory allocation error.

In the same time, gbm() has not problem to deal with the amount of data. According to the documentation the algorithm used by blackboost is the same as gbm. ("http://cran.r-project.org/web/packages/mboost/mboost.pdf").

It's not clear why one function is able to manage the database and not the other one, my guesses :

  • gbm has a subsampling strategy (set by the "bag.fraction" argument) which doesn't seem to be implemented in blackboost and impact the memory usage.
  • gbm use the CART function to build the trees and blackboost use ctree which seems to have a huge memory footprint (How to remove training data from party:::ctree models?)

I want to use the AUC() loss function available in mboost but not in gbm, so I would be interested in any suggestion to overcome the blackboost memory usage limits.

Another additional question, when I try to decrease the number of variables in my model, I get this new error from blackboost:

Error in matrix(f[ind1], nrow = n0, ncol = n1, byrow = TRUE) : the length of the data [107324] is not a multiple of the number of lines [152107]

It seems to come from the AUC gradient function.

Thank you for your help.

like image 677
Alex Avatar asked Apr 18 '14 08:04

Alex


1 Answers

You are correct that the ctree is one of the causes. I show a script below which illustrate the this point. You can reduce the memory requirements somewhat by settings control = party::ctree_control(..., remove_weights = TRUE) as I show. However, you cannot avoid the additional stored data.frame and some other causes of memory usage as far as I am aware.

Here is the example:

# Load data and set options
options(digits = 4)
data("BostonHousing", package = "mlbench")

# Size of the training size
object.size(BostonHousing) / 10^6 # in MB
#> 0.1 bytes

# blackboost and mboost stores a ctree like structure not on the object itself 
# but in an environment in the background. These can be big!
# First, we use some of the default settings
ctrl_lrg_mem <- party::ctree_control(
  teststat = "max",
  testtype = "Teststatistic",
  mincriterion = 0,
  maxdepth = 3,
  stump = FALSE,
  minbucket = 20,
  savesplitstats = FALSE, # Default w/ mboost
  remove_weights = FALSE) # Default w/ mboost

gc() # shows memory usage before
#>           used  (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells 2467924 131.9    3886542 207.6  3886542 207.6
#> Vcells 4553719  34.8   14341338 109.5 22408297 171.0
fit1 <- mboost::blackboost(
  medv ~ ., data = BostonHousing,
  tree_controls = ctrl_lrg_mem,
  control = mboost::boost_control(
    mstop = 100))
gc() # shows memory usage after
#>           used  (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells 2494735 133.3    3886542 207.6  3886542 207.6
#> Vcells 5608368  42.8   14341338 109.5 22408297 171.0

# It is not the object it self that requires a lot of memory 
object.size(fit1) / 10^6
#> 1.3 bytes

# It is the objects stored in the environments in the back
tmp_env <- environment(fit1$predict)
length(tmp_env$ens) # The boosted trees
#> [1] 100
sum(unlist(lapply(tmp_env$ens, object.size))) / 10^6
#> [1] 7.312

# Moreover, there is also a model frame for the data stored in the baselearner 
# function's environment which takes some space
env <- environment(fit1$basemodel[[1]]$fit)
str(env$df) # data frame of initial data
#> 'data.frame':    506 obs. of  14 variables:
#>  $ crim                     : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
#>  $ zn                       : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
#>  $ indus                    : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
#>  $ chas                     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ nox                      : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
#>  $ rm                       : num  6.58 6.42 7.18 7 7.15 ...
#>  $ age                      : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
#>  $ dis                      : num  4.09 4.97 4.97 6.06 6.06 ...
#>  $ rad                      : num  1 2 2 3 3 3 5 5 5 5 ...
#>  $ tax                      : num  296 242 242 222 222 222 311 311 311 311 ...
#>  $ ptratio                  : num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
#>  $ b                        : num  397 397 393 395 397 ...
#>  $ lstat                    : num  4.98 9.14 4.03 2.94 5.33 ...
#>  $ WLKJDJDQYBTDQCZDNHZMPZNCS: num  0 0 0 0 0 0 0 0 0 0 ...
object.size(env$df) / 10^6
#> 0.1 bytes
# str(env$object) # output excluded for space reasons
object.size(env$object) / 10^6
#> 0.8 bytes

# The above implies that if you data is 1GB then the fit will require 1 GB as
# well as far as I gather

# We can though reduce the memory requirements
ctrl_sml_mem <- party::ctree_control(
  teststat = "max",
  testtype = "Teststatistic",
  mincriterion = 0,
  maxdepth = 3,
  stump = FALSE,
  minbucket = 20,
  savesplitstats = FALSE,
  remove_weights = TRUE)  # Changed

gc()
#>           used  (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells 2494810 133.3    3886542 207.6  3886542 207.6
#> Vcells 5608406  42.8   14341338 109.5 22408297 171.0
fit2 <- mboost::blackboost(
  medv ~ ., data = BostonHousing,
  tree_controls = ctrl_sml_mem,
  control = mboost::boost_control(
    mstop = 100))
gc()
#>           used  (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells 2520425 134.7    3886542 207.6  3886542 207.6
#> Vcells 6081411  46.4   14341338 109.5 22408297 171.0

# Reduces the size of the objects in the back
tmp_env <- environment(fit2$predict)
length(tmp_env$ens) # The boosted trees
#> [1] 100
sum(unlist(lapply(tmp_env$ens, object.size))) / 10^6
#> [1] 2.611

#####
# The version I run
sessionInfo(package = c("party", "mboost"))
#> R version 3.4.0 (2017-04-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows >= 8 x64 (build 9200)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
#> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
#> [5] LC_TIME=English_United Kingdom.1252    
#> 
#> attached base packages:
#> character(0)
#> 
#> other attached packages:
#> [1] party_1.2-3  mboost_2.8-0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.11        compiler_3.4.0      formatR_1.4         git2r_0.18.0        R.methodsS3_1.7.1  
#>  [6] methods_3.4.0       R.utils_2.5.0       utils_3.4.0         tools_3.4.0         grDevices_3.4.0    
#> [11] boot_1.3-19         digest_0.6.12       jsonlite_1.4        memoise_1.1.0       R.cache_0.12.0     
#> [16] lattice_0.20-35     Matrix_1.2-9        shiny_1.0.2         parallel_3.4.0      curl_2.5           
#> [21] mvtnorm_1.0-6       speedglm_0.3-2      coin_1.1-3          R.rsp_0.41.0        withr_1.0.2        
#> [26] httr_1.2.1          stringr_1.2.0       knitr_1.15.1        stabs_0.6-2         graphics_3.4.0     
#> [31] datasets_3.4.0      stats_3.4.0         devtools_1.12.0     stats4_3.4.0        dynamichazard_0.3.0
#> [36] grid_3.4.0          base_3.4.0          data.table_1.10.4   R6_2.2.0            survival_2.41-2    
#> [41] multcomp_1.4-6      TH.data_1.0-8       magrittr_1.5        nnls_1.4            codetools_0.2-15   
#> [46] modeltools_0.2-21   htmltools_0.3.6     splines_3.4.0       MASS_7.3-47         rsconnect_0.7      
#> [51] strucchange_1.5-1   mime_0.5            xtable_1.8-2        httpuv_1.3.3        quadprog_1.5-5     
#> [56] sandwich_2.3-4      stringi_1.1.5       zoo_1.8-0           R.oo_1.21.0
like image 54
Benjamin Christoffersen Avatar answered Sep 28 '22 07:09

Benjamin Christoffersen