Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"No space left on device" error while fitting Sklearn model

I'm fitting a LDA model with lots of data using scikit-learn. Relevant code piece looks like this:

lda = LatentDirichletAllocation(n_topics = n_topics, 
                                max_iter = iters,
                                learning_method = 'online',
                                learning_offset = offset,
                                random_state = 0,
                                evaluate_every = 5,
                                n_jobs = 3,
                                verbose = 0)
lda.fit(X)

(I guess the only possibly relevant detail here is that I'm using multiple jobs.)

After some time I'm getting "No space left on device" error, even though there is plenty of space on the disk and plenty of free memory. I tried the same code several times, on two different computers (on my local machine and on a remote server), first using python3, then using python2, and each time I ended up with the same error.

If I run the same code on a smaller sample of data everything works fine.

The entire stack trace:

Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 271, in save
    obj, filename = self._write_array(obj, filename)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 231, in _write_array
    self.np.save(filename, array)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 491, in save
    pickle_kwargs=pickle_kwargs)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/format.py", line 584, in write_array
    array.tofile(fp)
IOError: 275500 requested and 210934 written


IOErrorTraceback (most recent call last)
<ipython-input-7-6af7e7c9845f> in <module>()
      7                                 n_jobs = 3,
      8                                 verbose = 0)
----> 9 lda.fit(X)

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in fit(self, X, y)
    509                     for idx_slice in gen_batches(n_samples, batch_size):
    510                         self._em_step(X[idx_slice, :], total_samples=n_samples,
--> 511                                       batch_update=False, parallel=parallel)
    512                 else:
    513                     # batch update

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _em_step(self, X, total_samples, batch_update, parallel)
    403         # E-step
    404         _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 405                                      parallel=parallel)
    406 
    407         # M-step

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _e_step(self, X, cal_sstats, random_init, parallel)
    356                                               self.mean_change_tol, cal_sstats,
    357                                               random_state)
--> 358             for idx_slice in gen_even_slices(X.shape[0], n_jobs))
    359 
    360         # merge result

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    808                 # consumption.
    809                 self._iterating = False
--> 810             self.retrieve()
    811             # Make sure that we get a last message telling us we are done
    812             elapsed_time = time.time() - self._start_time

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
    725                 job = self._jobs.pop(0)
    726             try:
--> 727                 self._output.extend(job.get())
    728             except tuple(self.exceptions) as exception:
    729                 # Stop dispatching any new job in the async callback thread

/home/ubuntu/anaconda2/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
    565             return self._value
    566         else:
--> 567             raise self._value
    568 
    569     def _set(self, i, obj):

IOError: [Errno 28] No space left on device
like image 860
machaerus Avatar asked Oct 18 '16 18:10

machaerus


3 Answers

Had the same problem with LatentDirichletAllocation. It seems, that your are running out of shared memory (/dev/shm when you run df -h). Try setting JOBLIB_TEMP_FOLDER environment variable to something different: e.g., to /tmp. In my case it has solved the problem.

Or just increase the size of the shared memory, if you have the appropriate rights for the machine you are training the LDA on.

like image 164
silentser Avatar answered Oct 24 '22 05:10

silentser


This problem occurs when shared memory is consumed and no I/O operation is permissible. This is a frustrating problem that occurs to most of the Kaggle users while fitting machine learning models.

I overcame this problem by setting JOBLIB_TEMP_FOLDER variable using following code.

%env JOBLIB_TEMP_FOLDER=/tmp
like image 36
abhinav Avatar answered Oct 24 '22 05:10

abhinav


The solution of @silterser solved the problem for me.

If you want to set the environment variable in the code do this:

import os
os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'
like image 3
Minions Avatar answered Oct 24 '22 03:10

Minions