Why is scikit-learn's random forest using so much memory?

Tags:

I'm using scikit's Random Forest implementation:

sklearn.ensemble.RandomForestClassifier(n_estimators=100, 
                                        max_features="auto", 
                                        max_depth=10)

After calling rf.fit(...), the process's memory usage increases by 80MB, or 0.8MB per tree (I also tried many other settings with similar results. I used top and psutil to monitor the memory usage)

A binary tree of depth 10 should have, at most, 2^11-1 = 2047 elements, which can all be stored in one dense array, allowing the programmer to find parents and children of any given element easily.

Each element needs an index of the feature used in the split and the cut-off, or 6-16 bytes, depending on how economical the programmer is. This translates into 0.01-0.03MB per tree in my case.

Why is scikit's implementation using 20-60x as much memory to store a tree of a random forest?

688

asked Dec 06 '13 00:12

MWB

1 Answers

Each decision (non-leaf) node stores the left and right branch integer indices (2 x 8 bytes), the index of the feature used to split (8 bytes), the float value of the threshold for the decision feature (8 bytes), the decrease in impurity (8 bytes). Furthermore leaf nodes store the constant target value predicted by the leaf.

You can have a look at the Cython class definition in the source code for the details.

124

answered Sep 23 '22 00:09

ogrisel

Related questions
                            
                                std::function has performances issues, how to avoid it?
                            
                                How does shuffling work with ImageDataGenerator in Machine Learning?
                            
                                How to model a shared layer in keras?
                            
                                sigmoid_cross_entropy loss function from tensorflow for image segmentation
                            
                                definition of error rate in classification and why some researchers use error rate instead of accuracy
                            
                                Column-dependent bounds in torch.clamp
                            
                                PyTorch LSTM input dimension
                            
                                Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?
                            
                                FastAi What does the slice(lr) do in fit_one_cycle()
                            
                                Implementing a trainable generalized Bump function layer in Keras/Tensorflow
                            
                                Sequence to Sequence - for time series prediction
                            
                                How to design a neural network to predict arrays from arrays
                            
                                Neural network in MATLAB
                            
                                Can k-means fall into an infinite loop ?
                            
                                NLTK/NLP buliding a many-to-many/multi-label subject classifier
                            
                                10*10 fold cross validation in scikit-learn?
                            
                                Disease named entity recognition
                            
                                How to approach Machine Learning problems with dynamically sized input collection?
                            
                                bag of words - image classification
                            
                                facial expression classification in real time using SVM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is scikit-learn's random forest using so much memory?

Tags:

machine-learning

classification

scikit-learn

decision-tree

random-forest

MWB

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us