How to improve randomForest performance?

Tags:

I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R snippet, to train the model using randomForest. This is taking hours for me.

rf.model <- randomForest(
              Weekly_Sales~.,
              data=newdata,
              keep.forest=TRUE,
              importance=TRUE,
              ntree=200,
              do.trace=TRUE,
              na.action=na.roughfix
            )

I think, due to na.roughfix, it is taking long time to execute. There are so many NA's in the training set.

Could someone let me know how can I improve the performance?

My system configuration is:

Intel(R) Core i7 CPU @ 2.90 GHz
RAM - 8 GB
HDD - 500 GB
64 bit OS

657

asked Apr 15 '14 05:04

user3497321

2 Answers

(The tl;dr is you should a) increase nodesize to >> 1 and b) exclude very low-importance feature columns, maybe even exclude (say) 80% of your columns. Your issue is almost surely not na.roughfix, but if you suspect that, run na.roughfix separately as a standalone step, before calling randomForest. Get that red herring out of the way at first.)

Now, all of the following advice only applies until you blow out your memory limits, so measure your memory usage, and make sure you're not exceeding. (Start with ridiculously small parameters, then scale them up, measure the runtime, and keep checking it didn't increase disproportionately.)

The main parameters affecting the performance of randomForest are:

mtry (less is faster)
ntrees
number of features/cols in data - more is quadratically slower, or worse! See below
number of observations/rows in data
ncores (more is faster - as long as parallel option is being used)
some performance boost by setting importance=F and proximity=F (don't compute proximity matrix)
Never ever use the insane default nodesize=1, for classification! In Breiman's package, you can't directly set maxdepth, but use nodesize as a proxy for that, and also read all the good advice at: CrossValidated: "Practical questions on tuning Random Forests"
So here your data has 4.2e+5 rows, then if each node shouldn't be smaller than ~0.1%, try nodesize=42. (First try nodesize=420 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.)
runtime is proportional to ~ 2^D_max, i.e. polynomial to (-log1p(nodesize))
optionally you can also speedup by using sampling, see strata,sampsize arguments

Then a first-order estimate of runtime, denoting mtry=M, ntrees=T, ncores=C, nfeatures=F, nrows=R, maxdepth=D_max, is:

Runtime proportional to: T * F^2 * (R^1.something) * 2^D_max / C

(Again, all bets are off if you exceed memory. Also, try running on only one core, then 2, then 4 and verify you actually do get linear speedup. And not slowdown.) (The effect of large R is worse than linear, maybe quadratic, since tree-partitioning has to consider all partitions of the data rows; certainly it's somewhat worse than linear. Check that by using sampling or indexing to only give it say 10% of rows).

Tip: keeping lots of crap low-importance features quadratically increases runtime, for a sublinear increase in accuracy. This is because at each node, we must consider all possible feature selection (or whatever number mtry) allows. And within each tree, we must consider all (F-choose-mtry) possible combinations of features. So here's my methodology, doing "fast-and-dirty feature-selection for performance":

generate a tree normally (slow), although use a sane nodesize=42 or larger
look at rf$importances or randomForest::varImpPlot(). Pick only the top-K features, where you choose K; for a silly-fast example, choose K=3. Save that entire list for future reference.
now rerun the tree but only give it newdata[,importantCols]
confirm that speed is quadratically faster, and oob.error is not much worse
once you know your variable importances, you can turn off importance=F
tweak mtry and nodesize (tweak one at a time), rerun and measure speed improvement
plot your performance results on logarithmic axes
post us the results! Did you corroborate the above? Any comments on memory usage?

(Note that the above is not a statistically valid procedure for actual feature-selection, do not rely on it for that, read randomForest package for the actual proper methods for RF-based feature-selection.)

answered Oct 23 '22 13:10

smci

I suspect do.trace might also consume time... instead do.trace = TRUE, you can used do.trace = 5 (to show only 5 traces) just to have some feel about errors. For large dataset, do.trace would take up a lot time as well.

answered Oct 23 '22 13:10

Manoj Kumar

Related questions
                            
                                lib unspecified & Error in loadNamespace
                            
                                Create a histogram for weighted values
                            
                                Using R from Scala and invoking Scala from R?
                            
                                print or display variable inside function
                            
                                How to create base R plot 'type = b' equivalent in ggplot2?
                            
                                dplyr group by colnames described as vector of strings
                            
                                Replace column names in kable/R markdown
                            
                                What does c do in R? [duplicate]
                            
                                r modify and rebuild package
                            
                                How do I show all boxplot labels
                            
                                R: how to check whether a vector is ascending/descending
                            
                                Convert and save distance matrix to a specific format
                            
                                visualize a list of colors/palette in R
                            
                                How to remove columns with same value in R
                            
                                In R, What is the difference between df["x"] and df$x
                            
                                Create counter within consecutive runs of certain values
                            
                                Functions available for Tufte boxplots in R?
                            
                                How can I make a list of all dataframes that are in my global environment?
                            
                                R: splitting dataset into quartiles/deciles. What is the right method? [duplicate]
                            
                                Make Emacs ESS follow R style guide

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to improve randomForest performance?

Tags:

performance

r

random-forest

user3497321

People also ask

2 Answers

smci

Manoj Kumar

Recent Activity

Donate For Us