Random forest on a big dataset

Tags:

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.

Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.

Any suggestions or workaround ideas are much appreciated.

244

asked Apr 05 '12 23:04

ktdrv

Video Answer

1 Answers

You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.

But it's hard to help more, given that you've provided no details about the actual code you're using.

162

answered Oct 16 '22 16:10

joran

Related questions
                            
                                odbcConnectExcel function from RODBC package for R not found on Ubuntu
                            
                                Apply over two data frames
                            
                                How to use acast (reshape2) within a function in R?
                            
                                best time date format for R [duplicate]
                            
                                Drawing maps without margins in R
                            
                                What is the second column of `str` report in R and what does `atomic` in this column mean?
                            
                                Remove NA when using "order"
                            
                                Regression evaluation in R
                            
                                Removing Unused Factors from a Facet in ggplot2
                            
                                Add statistical information to the bottom of a graph
                            
                                Finding list of positions in multidimensional structure (array)
                            
                                Storing specific XML node values with R's xmlEventParse
                            
                                rgdal package lat/long -> UTM
                            
                                R RODBC putting list of numbers into an IN() statement
                            
                                available.packages by publication date
                            
                                When running R, how to exit from Emacs-ESS gracefully?
                            
                                Suppress C warning messages in R
                            
                                R iterate over columns dataframe
                            
                                How can I concatenate compound language objects in R?
                            
                                How to get a data.frame into a multidimensional array in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Random forest on a big dataset

Tags:

r

machine-learning

random-forest

ktdrv

People also ask

Video Answer

1 Answers

joran

Recent Activity

Donate For Us