Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random Forest: Running out of memory

I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.

What are some of the ways to deal with this?

rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
like image 859
ananuc Avatar asked Dec 29 '25 08:12

ananuc


1 Answers

Your best choice is to tune the arguments.

n_jobs=4

This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs to 2 or 1 to save memory. n_jobs==4 uses four times the memory n_jobs==1 uses.

cv=20

This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.

n_estimators = 100

Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.

To sum up, I'd recommend reducing n_jobs to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv to 10 (2-fold savings in runtime). If that does not help, change n_jobs to 1 and also reduce the number of estimators to 50 (extra two times faster processing).

like image 78
Timo Avatar answered Jan 03 '26 09:01

Timo