Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is scikit-learn's random forest using so much memory?

I'm using scikit's Random Forest implementation:

sklearn.ensemble.RandomForestClassifier(n_estimators=100, 
                                        max_features="auto", 
                                        max_depth=10)

After calling rf.fit(...), the process's memory usage increases by 80MB, or 0.8MB per tree (I also tried many other settings with similar results. I used top and psutil to monitor the memory usage)

A binary tree of depth 10 should have, at most, 2^11-1 = 2047 elements, which can all be stored in one dense array, allowing the programmer to find parents and children of any given element easily.

Each element needs an index of the feature used in the split and the cut-off, or 6-16 bytes, depending on how economical the programmer is. This translates into 0.01-0.03MB per tree in my case.

Why is scikit's implementation using 20-60x as much memory to store a tree of a random forest?

like image 688
MWB Avatar asked Dec 06 '13 00:12

MWB


People also ask

How can I speed up random forest?

If you wish to speed up your random forest, lower the number of estimators. If you want to increase the accuracy of your model, increase the number of trees. Specify the maximum number of features to be included at each node split. This depends very heavily on your dataset.

Is random forest faster on GPU?

We trained a random forest model using 300 million instances: Spark took 37 minutes on a 20-node CPU cluster, whereas RAPIDS took 1 second on a 20-node GPU cluster. That's over 2000x faster with GPUs 🤯! Warp speed random forest with GPUs and RAPIDS!

Why are random forests prone to overfitting?

Random Forest Theory The Random Forest with only one tree will overfit to data as well because it is the same as a single decision tree. When we add trees to the Random Forest then the tendency to overfitting should decrease (thanks to bagging and random feature selection).

How much data does it take to train a random forest?

For testing, 10 is enough but to achieve robust results, you can increase it up to 100 or 500. This however only makes sense if you have more than 8 input rasters, otherwise the training data is always the same, even if you repeat it 1000 times.


1 Answers

Each decision (non-leaf) node stores the left and right branch integer indices (2 x 8 bytes), the index of the feature used to split (8 bytes), the float value of the threshold for the decision feature (8 bytes), the decrease in impurity (8 bytes). Furthermore leaf nodes store the constant target value predicted by the leaf.

You can have a look at the Cython class definition in the source code for the details.

like image 124
ogrisel Avatar answered Sep 23 '22 00:09

ogrisel