I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R
snippet, to train the model using randomForest
. This is taking hours for me.
rf.model <- randomForest(
Weekly_Sales~.,
data=newdata,
keep.forest=TRUE,
importance=TRUE,
ntree=200,
do.trace=TRUE,
na.action=na.roughfix
)
I think, due to na.roughfix
, it is taking long time to execute. There are so many NA's
in the training set.
Could someone let me know how can I improve the performance?
My system configuration is:
Intel(R) Core i7 CPU @ 2.90 GHz
RAM - 8 GB
HDD - 500 GB
64 bit OS
By increasing nodesize and tree size, you are putting two factors against one other, if you want to increase the predictive power, go for smaller node size and more trees. Trees will be larger and go till end, also number of trees increase , so ensemble of them will perform better.
The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.
Accuracy: 87.87 %. Accuracy of 87.8% is not a very great score and there is a lot of scope for improvement. Let's plot the difference between the actual and the predicted value.
Instead of developing a more complex model to improve our random forest, we took the sensible step of collecting more data points and additional features. This approach was validated as we were able to decrease the error of compared to the model trained on limited data by 16.7%.
How Random Forest Works? In a Random Forest, algorithms select a random subset of the training data set. Then It makes a decision tree on each of the sub-dataset. After that, it aggregates the score of each decision tree to determine the class of the test object.
But wait do you know you can improve the accuracy of the score through tuning the parameters of the Random Forest. Yes, rather than completely depend upon adding new data to improve accuracy, you can tune the hyperparameters to improve the accuracy. In this tutorial of “how to, you will know how to improve the accuracy of random forest classifier.
If you have a dataset that has many outliers, missing values or skewed data, it is very useful. In the background, Random Forest Tree has hundreds of trees, Due to this, it takes more time to predict, therefore you should not use it for real-time predictions.
(The tl;dr is you should a) increase nodesize to >> 1 and b) exclude very low-importance feature columns, maybe even exclude (say) 80% of your columns. Your issue is almost surely not na.roughfix
, but if you suspect that, run na.roughfix
separately as a standalone step, before calling randomForest
. Get that red herring out of the way at first.)
Now, all of the following advice only applies until you blow out your memory limits, so measure your memory usage, and make sure you're not exceeding. (Start with ridiculously small parameters, then scale them up, measure the runtime, and keep checking it didn't increase disproportionately.)
The main parameters affecting the performance of randomForest are:
nodesize=1
, for classification! In Breiman's package, you can't directly set maxdepth, but use nodesize
as a proxy for that, and also read all the good advice at: CrossValidated: "Practical questions on tuning Random Forests"
nodesize=42
. (First try nodesize=420 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.)strata,sampsize
argumentsThen a first-order estimate of runtime, denoting mtry=M, ntrees=T, ncores=C, nfeatures=F, nrows=R, maxdepth=D_max, is:
Runtime proportional to: T * F^2 * (R^1.something) * 2^D_max / C
(Again, all bets are off if you exceed memory. Also, try running on only one core, then 2, then 4 and verify you actually do get linear speedup. And not slowdown.) (The effect of large R is worse than linear, maybe quadratic, since tree-partitioning has to consider all partitions of the data rows; certainly it's somewhat worse than linear. Check that by using sampling or indexing to only give it say 10% of rows).
Tip: keeping lots of crap low-importance features quadratically increases runtime, for a sublinear increase in accuracy. This is because at each node, we must consider all possible feature selection (or whatever number mtry) allows. And within each tree, we must consider all (F-choose-mtry) possible combinations of features. So here's my methodology, doing "fast-and-dirty feature-selection for performance":
nodesize=42
or largerrandomForest::varImpPlot()
. Pick only the top-K features, where you choose K; for a silly-fast example, choose K=3. Save that entire list for future reference.newdata[,importantCols]
importance=F
(Note that the above is not a statistically valid procedure for actual feature-selection, do not rely on it for that, read randomForest package for the actual proper methods for RF-based feature-selection.)
I suspect do.trace might also consume time... instead do.trace = TRUE, you can used do.trace = 5 (to show only 5 traces) just to have some feel about errors. For large dataset, do.trace would take up a lot time as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With