scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays

Tags:

My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.

The test script is:

import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp


def main():
    print("Started.")

    print("numpy:", np.__version__)
    print("sklearn:", sklearn.__version__)

    n_samples = 1000000
    n_features = 1000

    X_train = np.random.randn(n_samples, n_features)
    y_train = np.random.randint(0, 2, size=n_samples)

    print("input data size: %.3fMB" % (X_train.nbytes / 1e6))

    model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
    param_grid = [{
        'alpha' : 10.0 ** -np.arange(1,7),
        'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
    }]
    gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
    gs.fit(X_train, y_train)
    print(gs.grid_scores_)

if __name__=='__main__':
    mp.freeze_support()
    main()

This results in the output:

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Started.
('numpy:', '1.8.1')
('sklearn:', '0.15.0b1')
input data size: 8000.000MB
Fitting 3 folds for each of 48 candidates, totalling 144 fits
Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 240, in save
    obj, filename = self._write_array(obj, filename)
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 203, in _write_array
    self.np.save(filename, array)
  File "C:\Anaconda\lib\site-packages\numpy\lib\npyio.py", line 453, in save
    format.write_array(fid, arr)
  File "C:\Anaconda\lib\site-packages\numpy\lib\format.py", line 406, in write_array
    array.tofile(fp)
ValueError: 1000000000 requested and 268435456 written

Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Traceback (most recent call last):
  File "S:\laszlo\gridsearch_largearray.py", line 33, in <module>
    main()
  File "S:\laszlo\gridsearch_largearray.py", line 28, in main
    gs.fit(X_train, y_train)
  File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 597, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 379, in _fit
    for parameters in parameter_iterable
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 651, in __call__
    self.retrieve()
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 503, in retrieve
    self._output.append(job.get())
  File "C:\Anaconda\lib\multiprocessing\pool.py", line 558, in get
    raise self._value
struct.error: integer out of range for 'i' format code

EDIT: ogrisel's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)

747

asked Jun 25 '14 11:06

László

1 Answers

As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

Edit #1: Here is the important part:

from sklearn.externals import joblib

joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')

Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

OSError: [WinError 8] Not enough storage is available to process this command

So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.

163

answered Nov 13 '22 13:11

ogrisel

Related questions
                            
                                How to align multiple camera images using opencv
                            
                                Multiprocessing objects with namedtuple - Pickling Error
                            
                                Map string values in a Pandas Dataframe with integers
                            
                                Developing Linux Kernel Modules in Python
                            
                                Running a py file from cmd that contain matplotlib plot
                            
                                Logging to both thread-specific and shared log files in Python
                            
                                Pandas group by time windows
                            
                                How to simplify boundary geometries in shapely
                            
                                Pythonic way: Utility functions in class or module [closed]
                            
                                indexing and finding values in list of namedtuples
                            
                                Pandas: Count Unique Values after Resample
                            
                                Proper use of QThread.currentThreadId()
                            
                                Base58Check encoding for Bitcoin addresses too long
                            
                                Loading data from Yahoo! Finance with pandas
                            
                                python argparse help message, disable metavar for short options?
                            
                                python exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in
                            
                                psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0
                            
                                Does pandas/scipy/numpy provide a cumulative standard deviation function?
                            
                                Is there a method to do arithmetic with SciPy's random variables?
                            
                                Python: Interpretation on XCORR

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays

Tags:

python

multiprocessing

numpy

anaconda

scikit-learn

László

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us