My code runs fine with smaller test samples, like 10000 rows of data in X_train
, y_train
. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.
The test script is:
import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp
def main():
print("Started.")
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)
n_samples = 1000000
n_features = 1000
X_train = np.random.randn(n_samples, n_features)
y_train = np.random.randint(0, 2, size=n_samples)
print("input data size: %.3fMB" % (X_train.nbytes / 1e6))
model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
param_grid = [{
'alpha' : 10.0 ** -np.arange(1,7),
'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
}]
gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
gs.fit(X_train, y_train)
print(gs.grid_scores_)
if __name__=='__main__':
mp.freeze_support()
main()
This results in the output:
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Started.
('numpy:', '1.8.1')
('sklearn:', '0.15.0b1')
input data size: 8000.000MB
Fitting 3 folds for each of 48 candidates, totalling 144 fits
Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 240, in save
obj, filename = self._write_array(obj, filename)
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 203, in _write_array
self.np.save(filename, array)
File "C:\Anaconda\lib\site-packages\numpy\lib\npyio.py", line 453, in save
format.write_array(fid, arr)
File "C:\Anaconda\lib\site-packages\numpy\lib\format.py", line 406, in write_array
array.tofile(fp)
ValueError: 1000000000 requested and 268435456 written
Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Traceback (most recent call last):
File "S:\laszlo\gridsearch_largearray.py", line 33, in <module>
main()
File "S:\laszlo\gridsearch_largearray.py", line 28, in main
gs.fit(X_train, y_train)
File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 597, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 379, in _fit
for parameters in parameter_iterable
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 651, in __call__
self.retrieve()
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 503, in retrieve
self._output.append(job.get())
File "C:\Anaconda\lib\multiprocessing\pool.py", line 558, in get
raise self._value
struct.error: integer out of range for 'i' format code
EDIT: ogrisel
's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)
Pool in some cases #1108.
The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax. Under Windows, the use of multiprocessing. Pool requires to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.
multiprocessing. freeze_support() This function will allow a frozen program to create and start new processes via the multiprocessing. Process class when the program is frozen for distribution on Windows. If the function is called and the program is not frozen for distribution, then it has no effect.
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.
As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.
Edit #1: Here is the important part:
from sklearn.externals import joblib
joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')
Then pass this memmap'ed data to GridSearchCV
under scikit-learn 0.15+.
Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.
I just found a bug for numpy.save
under Python 3.4 but even when fixed the subsequent call to mmap will fail with:
OSError: [WinError 8] Not enough storage is available to process this command
So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).
Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel
memory maps input data with mmap_mode='c'
by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r'
or mmap_mode='r+'
does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With