I am attempting an R randomForest analysis in R on a wide genetic dataset(662 x 35350). All variables except the outcome are numeric and 99% of them are binary 0/1. I am quite familiar with R randomForest(), but have only worked with datasets with 5000-10000 variables previously. The next planned phase of analyses will be on an exceptionally large dataset with millions of variables so I am motivated to find a solution to this problem.
My understanding is that R randomForest has no inherent limits on number of variables and I know I've read published work with variable numbers in the 100,000s. When I attempt the analysis on the full dataset (setting ntree=100, I get: "Error: protect(): protection stack overflow"
This is true whether the dataset is a dataframe (as it was originally provided) or when I convert it to a matrix. When I submit the run to a cluster for parallel processing, I see that all of my cores are working as soon as I execute the code. I also see that, at no point, does my RAM usage approach the machine's limit (48 GB). At most it hits about 16% of RAM during the execution attempt. (I also had the same problem on my 512 GB RAM machine at the office, where it never used more than about 5%).
I have tried several solutions found online, including one in a previous stackoverflow post (Increasing (or decreasing) the memory available to R processes). I tried the instructions provided by BobbyShaftoe in 2009 (adding --max-mem-size=49000M and --max-vsize=49000M within the properties of the Shortcut tab), but this prevented R from opening properly. I also tried running these commands in command line, but these generated: '--max-ppsize'/'--max-vsize=5000M" is not recognized as an internal or external command, operable program or batch file.
I have also read the suggestions made in this post: How to improve randomForest performance?. I can't reduce the number of features until I have at least one run with the full feature set. (Plus, I am not sure the problem is RAM, per se.)
I'm on Windows 7 running Revolution R 7.2 (64-bit). My memory limit is set at 49807 Mb, but I'm not sure if the memory.limit is specifially addressing the protection stack size allowed.
Breaking the dataset into smaller chunks of variables (which does solve the stack overflow problem) does not solve the analytic problem. Are there any suggestions as to R settings that may permit the analysis on the full dataset?
##########################################
# input DF
##########################################
object.size(inputDF) # 191083664 bytes (as matrix, size=189391088 bytes, not much smaller)
dim(inputDF) # 662 x 35350
##########################################
#Load necessary packages into R's memory
##########################################
require(iterators)
require(foreach)
require(parallel)
require(doParallel)
require(randomForest)
###########################################
# Get the number of available logical cores
###########################################
cores <- detectCores()
cores #12
###########################################
# Print info on computer, OS, cores
###########################################
print(paste('Processor: ', Sys.getenv('PROCESSOR_IDENTIFIER')), sep='')
print(paste('OS: ', Sys.getenv('OS')), sep='')
print(paste('Cores: ', cores, sep=''))
###########################################################################
# Setup clusters via parallel/DoParallel
###########################################################################
cl.spec <- rep("localhost", 10)
cl <- makeCluster(cl.spec, type="SOCK")
registerDoParallel(cl, cores=10)
###########################################################################
# RUN RANDOM FOREST
###########################################################################
system.time(forestOUT<- randomForest(as.factor(Dx01) ~ .,
data=inputDF,
do.trace = 10,
ntree=100,
mtry = sqrt(ncol(inputDF)),
nodesize = 0.1*nrow(inputDF),
importance=T,
proximity=F,
replace=TRUE,
keep.forest=TRUE))
stopCluster(cl)
See sessionInfo()
#Language: R
#OS: Windows 7
sessionInfo()
#R version 3.0.3 (2014-03-06)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
#[5] LC_TIME=English_Canada.1252
#
#attached base packages:
#[1] stats4 parallel splines grid stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] QuantPsyc_1.5 boot_1.3-13 perturb_2.05 RCurl_1.95-4.5 bitops_1.0-6 car_2.0-22
#[7] reprtree_0.6 plotrix_3.5-10 rpart.plot_1.4-5 sqldf_0.4-7.1 RSQLite.extfuns_0.0.1 RSQLite_1.0.0
#[13] gsubfn_0.6-6 proto_0.3-10 XML_3.98-1.1 RMySQL_0.9-3 DBI_0.3.1 mlbench_2.1-1
#[19] polycor_0.7-8 sfsmisc_1.0-26 quantregForest_0.2-3 tree_1.0-35 maptree_1.4-7 cluster_1.15.3
#[25] mice_2.22 VIM_4.0.0 colorspace_1.2-4 randomForest_4.6-10 ROCR_1.0-5 gplots_2.15.0
#[31] caret_6.0-37 partykit_0.8-0 biomaRt_2.18.0 NCBI2R_1.4.6 snpStats_1.12.0 betareg_3.0-5
#[37] arm_1.7-07 lme4_1.1-7 Rcpp_0.11.3 Matrix_1.1-4 nlme_3.1-118 mvtnorm_1.0-1
#[43] taRifx_1.0.6 sos_1.3-8 brew_1.0-6 R.utils_1.34.0 R.oo_1.18.0 R.methodsS3_1.6.1
#[49] rattle_3.3.0 jsonlite_0.9.13 httpuv_1.3.2 httr_0.5 gmodels_2.15.4.1 ggplot2_1.0.0
#[55] JGR_1.7-16 iplots_1.1-7 JavaGD_0.6-1 party_1.0-18 modeltools_0.2-21 strucchange_1.5-0
#[61] sandwich_2.3-2 zoo_1.7-11 pROC_1.7.3 e1071_1.6-4 psych_1.4.8.11 gtools_3.4.1
#[67] functional_0.6 modeest_2.1 stringi_0.3-1 languageR_1.4.1 utility_1.3 data.table_1.9.4
#[73] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-6 snow_0.3-13 doParallel_1.0.8 iterators_1.0.7
#[79] foreach_1.4.2 reshape2_1.4 reshape_0.8.5 plyr_1.8.1 xtable_1.7-4 stringr_0.6.2
#[85] foreign_0.8-61 Hmisc_3.14-6 Formula_1.1-2 survival_2.37-7 class_7.3-11 MASS_7.3-35
#[91] nnet_7.3-8 Revobase_7.2.0 RevoMods_7.2.0 RevoScaleR_7.2.0 lattice_0.20-27 rpart_4.1-5
#
#loaded via a namespace (and not attached):
#[1] abind_1.4-0 acepack_1.3-3.3 BiocGenerics_0.8.0 BradleyTerry2_1.0-5 brglm_0.5-9 caTools_1.17.1 chron_2.3-45
#[8] coda_0.16-1 codetools_0.2-9 coin_1.0-24 DEoptimR_1.0-2 digest_0.6.4 flexmix_2.3-12 gdata_2.13.3
#[15] glmnet_1.9-8 gtable_0.1.2 KernSmooth_2.23-13 latticeExtra_0.6-26 lmtest_0.9-33 minqa_1.2.4 munsell_0.4.2
#[22] nloptr_1.0.4 pkgXMLBuilder_1.0 png_0.1-7 RColorBrewer_1.0-5 revoIpe_1.0 robustbase_0.92-2 scales_0.2.4
#[29] sp_1.0-16 tcltk_3.0.3 tools_3.0.3 vcd_1.3-2
To revive an old question, I had the same problem and the following solution worked for me (167 observations, 24000+ RNAseq features including numeric gene expression data and categorical metadata). I was able to run the code both on a compute cluster and on my 16 GB Surface Pro 4 locally.
Imagine big_df
is a data frame composed of predictor variables (e.g. var1
, var2
) and the response variable respvar
. I think, as suggested in this post, the culprit is the formula based model. When you provide the predictor variables and response variable separately in the function it works. The same solution also worked for me when I was trying to impute missing values prior to random forest analysis (rfImpute()
function).
# This fails
rf <- randomForest(respvar ~ ., data=big_df)
# This works
rf <- randomForest(x = big_df[, colnames(big_df) != "respvar"],
y = big_df$respvar)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With