Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random forest on a big dataset

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.

Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.

Any suggestions or workaround ideas are much appreciated.

like image 244
ktdrv Avatar asked Apr 05 '12 23:04

ktdrv


People also ask

How big should a random forest be?

They suggest that a random forest should have a number of trees between 64 - 128 trees. With that, you should have a good balance between ROC AUC and processing time.

What kind of data is random forest good for?

Random forests is great with high dimensional data since we are working with subsets of data. It is faster to train than decision trees because we are working only on a subset of features in this model, so we can easily work with hundreds of features.

Is random forest suitable for small dataset?

Conclusion: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.

Can random forest be used in deep learning?

Both the Random Forest and Neural Networks are different techniques that learn differently but can be used in similar domains. Random Forest is a technique of Machine Learning while Neural Networks are exclusive to Deep Learning.


Video Answer


1 Answers

You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.

But it's hard to help more, given that you've provided no details about the actual code you're using.

like image 162
joran Avatar answered Oct 16 '22 16:10

joran