I am trying to run scikit learn random forest algorithm on the mnist handwritten digits dataset. During the training of the algorithm the system goes into a Memory Error. Please tell me what should I do to fix this issue.
CPU Statistics: Intel Core 2 Duo with 4GB RAM
The shape of dataset is 60000, 784. the complete error as on the linux terminal is as follows:
> File "./reducer.py", line 53, in <module>
> main() File "./reducer.py", line 38, in main
> clf = clf.fit(data,labels) #training the algorithm File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 202,
> in fit
> for i in xrange(n_jobs)) File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 409, in
> __call__
> self.dispatch(function, args, kwargs) File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 295, in
> dispatch
> job = ImmediateApply(func, args, kwargs) File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 101, in
> __init__
> self.results = func(*args, **kwargs) File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 73, in
> _parallel_build_trees
> sample_mask=sample_mask, X_argsorted=X_argsorted) File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 476, in fit
> X_argsorted=X_argsorted) File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 357, in
> _build_tree
> np.argsort(X.T, axis=1).astype(np.int32).T) File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line
> 680, in argsort
> return argsort(axis, kind, order) MemoryError
Either set n_jobs=1
or upgrade to the bleeding edge version of scikit-learn. The problem is that the currently released version uses multiple processes to fit trees in parallel, which means that the data (X
and y
) need to be copied to these processes. The next release will use threads instead of processes, so the tree learners share memory.
.ensemble
methodsWith all due respect to other opinions, scikit-learn 0.16.1
does not proof itself to have the "nasty" X
, y
replicas, cited for some early versions.
Due to some other reasons, I have spent rather a long time on the RandomForestRegressor()
hyperparameter's landscape, incl. their memory footprint problems.
As of 0.16.1
, there was less than 2% increase in the parallel-joblib memory requirements above a default n_jobs = 1
to { 2, 3, ... }
Co-father of recent scikit-learn
releases, @glouppe, posted a marvelous & insight-full presentation (2014-Aug, rel. 0.15.0), incl. comparisons with R-based and other known RandomForest frameworks.
IMHO, pages 25+ speak about techniques, that increase speed, incl. the np.asfortranarray(...)
, however these seem to me ( without any experimental proof ) as just internal directions shared inside the Scikit-learn development team, rather than a recommendation for us, the mortals, who live in the "outer world".
Yes, that matters. Some additional Feature-engineering efforts & testing might be in place if not doing a full-scale FeatureSET vector bagging. Your learner seems to be the Classifier case, so go deeper into:
max_features
et almkswap
+ swapon
if needed after tuning the learner in 1.After another round of testing, there has appeared one interesting observation.
While a .set_params( n_jobs = -1 ).fit( X, y )
configuration was used successfully on training the RandomForestRegressor()
the ugly surprise came later, once trying to use .predict( X_observed )
on such pre-trained object.
There a similar map/reduce-bound memory issue was reported (with 0.17.0 now).
Nevertheless, the same .set_params( n_jobs = 1 ).predict( X_observed )
solo-job was well served on .predict()
One solution can be to use the most recent version (0.19) of scikit-learn. In the change log, they mentioned in the bug fixes section (indeed, there is a major improvement):
Fixed excessive memory usage in prediction for random forests estimators. #8672 by Mike Benfield.
You can install this version by using:
pip3 install scikit-learn==0.19.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With