Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large data serialization on scikit-learn with Python 3

I've got a MacBook (Mac OS X 10.9) with 16 Gb of RAM. Two Pythons installed via Anaconda: 2.7.8 and 3.4.1. Both equipped with the latest scikit-learn 0.15.1. While trying to run that simple code (just testing the possibility to serialize large matrixes):

import numpy as np
test_data = np.random.rand(10000, 60000)
print(test_data.nbytes / 2**30)
from sklearn.externals import joblib
joblib.dump(test_data, '/Users/va/Desktop/test_data.joblib')

Python 2.7.8 is doing well, but Python 3.4.1 stuck with the following error:

Failed to save <class 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/site-
    packages/sklearn/externals/joblib/numpy_pickle.py", line 240, in save
    obj, filename = self._write_array(obj, filename)
  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/site-
    packages/sklearn/externals/joblib/numpy_pickle.py", line 203, in _write_array
    self.np.save(filename, array)
  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/site-
    packages/numpy/lib/npyio.py", line 453, in save
    format.write_array(fid, arr)
  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/site-
    packages/numpy/lib/format.py", line 410, in write_array
    fp.write(array.tostring('C'))
OSError: [Errno 22] Invalid argument

Traceback (most recent call last):

  File "<ipython-input-3-90ed09e5c6d4>", line 1, in <module>
    joblib.dump(test_data, '/Users/va/Desktop/test_data.joblib')

  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/site-
    packages/sklearn/externals/joblib/numpy_pickle.py", line 368, in dump
    pickler.dump(value)

  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/pickle.py", line 412, in dump
    self.framer.end_framing()

  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/pickle.py", line 196, in end_framing
    self.commit_frame(force=True)

  File "/Users/va/anaconda/python.app/Contents/lib/python3.4/pickle.py", line 208, in commit_frame
    write(data)

OSError: [Errno 22] Invalid argument

It appears the problem is in the amount of data to be stored. E.g., Python 3 handles np.random.rand(10000, 20000), which is 1.5 Gb, perfectly well.

Just in case, pickle didn't work as well:

import pickle
with open('/Users/va/Desktop/test_data.pkl', 'wb') as f:
    pickle.dump(test_data, f, protocol=pickle.HIGHEST_PROTOCOL)

goes to:

Traceback (most recent call last):

  File "<ipython-input-6-3f73f3011539>", line 3, in <module>
    pickle.dump(test_data, f, protocol=pickle.HIGHEST_PROTOCOL)

OSError: [Errno 22] Invalid argument

On Windows 7 Python 3.4 works fine with both joblib and pickle.

Any suggestions how to solve that problem with Python 3 on Mac?

like image 686
night_bat Avatar asked Aug 14 '14 07:08

night_bat


1 Answers

This happens to me on OS X 10.10 with Python 3.4.3 using pickle too

Instead I started using https://github.com/zopefoundation/zodbpickle, which is around 2-3 times slower, but definitely works with sklearn classifiers

like image 192
Adel Nizamutdinov Avatar answered Nov 03 '22 12:11

Adel Nizamutdinov