Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit Learn RandomForest Memory Error

I am trying to run scikit learn random forest algorithm on the mnist handwritten digits dataset. During the training of the algorithm the system goes into a Memory Error. Please tell me what should I do to fix this issue.

CPU Statistics: Intel Core 2 Duo with 4GB RAM

The shape of dataset is 60000, 784. the complete error as on the linux terminal is as follows:

> File "./reducer.py", line 53, in <module>
>     main()   File "./reducer.py", line 38, in main
>     clf = clf.fit(data,labels) #training the algorithm   File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 202,
> in fit
>     for i in xrange(n_jobs))   File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 409, in
> __call__
>     self.dispatch(function, args, kwargs)   File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 295, in
> dispatch
>     job = ImmediateApply(func, args, kwargs)   File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 101, in
> __init__
>     self.results = func(*args, **kwargs)   File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 73, in
> _parallel_build_trees
>     sample_mask=sample_mask, X_argsorted=X_argsorted)   File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 476, in fit
>     X_argsorted=X_argsorted)   File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 357, in
> _build_tree
>     np.argsort(X.T, axis=1).astype(np.int32).T)   File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line
> 680, in argsort
>     return argsort(axis, kind, order) MemoryError
like image 363
impiyush Avatar asked Apr 16 '14 19:04

impiyush


3 Answers

Either set n_jobs=1 or upgrade to the bleeding edge version of scikit-learn. The problem is that the currently released version uses multiple processes to fit trees in parallel, which means that the data (X and y) need to be copied to these processes. The next release will use threads instead of processes, so the tree learners share memory.

like image 187
Fred Foo Avatar answered Oct 20 '22 12:10

Fred Foo


Scikit-learn Dev Team improved a lot both memory management & performance on .ensemble methods

With all due respect to other opinions, scikit-learn 0.16.1 does not proof itself to have the "nasty" X, y replicas, cited for some early versions.

Due to some other reasons, I have spent rather a long time on the RandomForestRegressor() hyperparameter's landscape, incl. their memory footprint problems.

As of 0.16.1, there was less than 2% increase in the parallel-joblib memory requirements above a default n_jobs = 1 to { 2, 3, ... }

Co-father of recent scikit-learn releases, @glouppe, posted a marvelous & insight-full presentation (2014-Aug, rel. 0.15.0), incl. comparisons with R-based and other known RandomForest frameworks.

IMHO, pages 25+ speak about techniques, that increase speed, incl. the np.asfortranarray(...), however these seem to me ( without any experimental proof ) as just internal directions shared inside the Scikit-learn development team, rather than a recommendation for us, the mortals, who live in the "outer world".

Regression or Classification?

Yes, that matters. Some additional Feature-engineering efforts & testing might be in place if not doing a full-scale FeatureSET vector bagging. Your learner seems to be the Classifier case, so go deeper into:

  1. experiment on non-default settings for max_features et al
  2. use O/S services to handle larger virtual memory mkswap + swapon if needed after tuning the learner in 1.

Addendum

After another round of testing, there has appeared one interesting observation.

While a .set_params( n_jobs = -1 ).fit( X, y ) configuration was used successfully on training the RandomForestRegressor() the ugly surprise came later, once trying to use .predict( X_observed ) on such pre-trained object.

There a similar map/reduce-bound memory issue was reported (with 0.17.0 now).

Nevertheless, the same .set_params( n_jobs = 1 ).predict( X_observed ) solo-job was well served on .predict()

like image 2
user3666197 Avatar answered Oct 20 '22 11:10

user3666197


One solution can be to use the most recent version (0.19) of scikit-learn. In the change log, they mentioned in the bug fixes section (indeed, there is a major improvement):

 Fixed excessive memory usage in prediction for random forests estimators. #8672 by Mike Benfield.

You can install this version by using:

pip3 install scikit-learn==0.19.0
like image 1
Sanchit Avatar answered Oct 20 '22 12:10

Sanchit