I'm working on a database with around 250000 observation and 50 predictors (some are factors so in the end around 100 features) and I have trouble using the blackboost() function (from mboost package) which give me a memory allocation error.
In the same time, gbm() has not problem to deal with the amount of data. According to the documentation the algorithm used by blackboost is the same as gbm. ("http://cran.r-project.org/web/packages/mboost/mboost.pdf").
It's not clear why one function is able to manage the database and not the other one, my guesses :
I want to use the AUC() loss function available in mboost but not in gbm, so I would be interested in any suggestion to overcome the blackboost memory usage limits.
Another additional question, when I try to decrease the number of variables in my model, I get this new error from blackboost:
Error in matrix(f[ind1], nrow = n0, ncol = n1, byrow = TRUE) : the length of the data [107324] is not a multiple of the number of lines [152107]
It seems to come from the AUC gradient function.
Thank you for your help.
You are correct that the ctree
is one of the causes. I show a script below which illustrate the this point. You can reduce the memory requirements somewhat by settings control = party::ctree_control(..., remove_weights = TRUE)
as I show. However, you cannot avoid the additional stored data.frame
and some other causes of memory usage as far as I am aware.
Here is the example:
# Load data and set options
options(digits = 4)
data("BostonHousing", package = "mlbench")
# Size of the training size
object.size(BostonHousing) / 10^6 # in MB
#> 0.1 bytes
# blackboost and mboost stores a ctree like structure not on the object itself
# but in an environment in the background. These can be big!
# First, we use some of the default settings
ctrl_lrg_mem <- party::ctree_control(
teststat = "max",
testtype = "Teststatistic",
mincriterion = 0,
maxdepth = 3,
stump = FALSE,
minbucket = 20,
savesplitstats = FALSE, # Default w/ mboost
remove_weights = FALSE) # Default w/ mboost
gc() # shows memory usage before
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 2467924 131.9 3886542 207.6 3886542 207.6
#> Vcells 4553719 34.8 14341338 109.5 22408297 171.0
fit1 <- mboost::blackboost(
medv ~ ., data = BostonHousing,
tree_controls = ctrl_lrg_mem,
control = mboost::boost_control(
mstop = 100))
gc() # shows memory usage after
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 2494735 133.3 3886542 207.6 3886542 207.6
#> Vcells 5608368 42.8 14341338 109.5 22408297 171.0
# It is not the object it self that requires a lot of memory
object.size(fit1) / 10^6
#> 1.3 bytes
# It is the objects stored in the environments in the back
tmp_env <- environment(fit1$predict)
length(tmp_env$ens) # The boosted trees
#> [1] 100
sum(unlist(lapply(tmp_env$ens, object.size))) / 10^6
#> [1] 7.312
# Moreover, there is also a model frame for the data stored in the baselearner
# function's environment which takes some space
env <- environment(fit1$basemodel[[1]]$fit)
str(env$df) # data frame of initial data
#> 'data.frame': 506 obs. of 14 variables:
#> $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
#> $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
#> $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
#> $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
#> $ rm : num 6.58 6.42 7.18 7 7.15 ...
#> $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
#> $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
#> $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
#> $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
#> $ ptratio : num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
#> $ b : num 397 397 393 395 397 ...
#> $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
#> $ WLKJDJDQYBTDQCZDNHZMPZNCS: num 0 0 0 0 0 0 0 0 0 0 ...
object.size(env$df) / 10^6
#> 0.1 bytes
# str(env$object) # output excluded for space reasons
object.size(env$object) / 10^6
#> 0.8 bytes
# The above implies that if you data is 1GB then the fit will require 1 GB as
# well as far as I gather
# We can though reduce the memory requirements
ctrl_sml_mem <- party::ctree_control(
teststat = "max",
testtype = "Teststatistic",
mincriterion = 0,
maxdepth = 3,
stump = FALSE,
minbucket = 20,
savesplitstats = FALSE,
remove_weights = TRUE) # Changed
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 2494810 133.3 3886542 207.6 3886542 207.6
#> Vcells 5608406 42.8 14341338 109.5 22408297 171.0
fit2 <- mboost::blackboost(
medv ~ ., data = BostonHousing,
tree_controls = ctrl_sml_mem,
control = mboost::boost_control(
mstop = 100))
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 2520425 134.7 3886542 207.6 3886542 207.6
#> Vcells 6081411 46.4 14341338 109.5 22408297 171.0
# Reduces the size of the objects in the back
tmp_env <- environment(fit2$predict)
length(tmp_env$ens) # The boosted trees
#> [1] 100
sum(unlist(lapply(tmp_env$ens, object.size))) / 10^6
#> [1] 2.611
#####
# The version I run
sessionInfo(package = c("party", "mboost"))
#> R version 3.4.0 (2017-04-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows >= 8 x64 (build 9200)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> character(0)
#>
#> other attached packages:
#> [1] party_1.2-3 mboost_2.8-0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.11 compiler_3.4.0 formatR_1.4 git2r_0.18.0 R.methodsS3_1.7.1
#> [6] methods_3.4.0 R.utils_2.5.0 utils_3.4.0 tools_3.4.0 grDevices_3.4.0
#> [11] boot_1.3-19 digest_0.6.12 jsonlite_1.4 memoise_1.1.0 R.cache_0.12.0
#> [16] lattice_0.20-35 Matrix_1.2-9 shiny_1.0.2 parallel_3.4.0 curl_2.5
#> [21] mvtnorm_1.0-6 speedglm_0.3-2 coin_1.1-3 R.rsp_0.41.0 withr_1.0.2
#> [26] httr_1.2.1 stringr_1.2.0 knitr_1.15.1 stabs_0.6-2 graphics_3.4.0
#> [31] datasets_3.4.0 stats_3.4.0 devtools_1.12.0 stats4_3.4.0 dynamichazard_0.3.0
#> [36] grid_3.4.0 base_3.4.0 data.table_1.10.4 R6_2.2.0 survival_2.41-2
#> [41] multcomp_1.4-6 TH.data_1.0-8 magrittr_1.5 nnls_1.4 codetools_0.2-15
#> [46] modeltools_0.2-21 htmltools_0.3.6 splines_3.4.0 MASS_7.3-47 rsconnect_0.7
#> [51] strucchange_1.5-1 mime_0.5 xtable_1.8-2 httpuv_1.3.3 quadprog_1.5-5
#> [56] sandwich_2.3-4 stringi_1.1.5 zoo_1.8-0 R.oo_1.21.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With