Fastest save and load options for a numpy array

Tags:

I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right now I'm using np.save and np.load to perform IO operations with the arrays. However, these functions take several seconds for each array. Are there faster methods for saving and loading the entire arrays (i.e., without making assumptions about their contents and reducing them)? I'm open to converting the arrays to another type before saving as long as the data are retained exactly.

955

asked May 19 '15 15:05

dbliss

2 Answers

For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

NumPy.memmap, maps big arrays to binary form
- Pros :
  - No dependency other than Numpy
  - Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
- Cons :
  - Chunks of your array are limited to 2.5G
  - Still limited by Numpy throughput
Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py
- Pros :
  - Format supports compression, indexing, and other super nice features
  - Apparently the ultimate PetaByte-large file format
- Cons :
  - Learning curve of having a hierarchical format ?
  - Have to define what your performance needs are (see later)
Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)
- Pros:
  - It's Pythonic ! (haha)
  - Supports all sorts of objects
- Cons:
  - Probably slower than others (because aimed at any objects not arrays)

Numpy.memmap

From the docs of NumPy.memmap :

Create a memory-map to an array stored in a binary file on disk.

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.

HDF5 arrays

From the h5py doc

Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

155

answered Sep 24 '22 20:09

Jiby

I've compared a few methods using perfplot (one of my projects). Here are the results:

Writing

enter image description here

For large arrays, all methods are about equally fast. The file sizes are also equal which is to be expected since the input array are random doubles and hence hardly compressible.

Code to reproduce the plot:

import perfplot import pickle import numpy import h5py import tables import zarr   def npy_write(data):     numpy.save("npy.npy", data)   def hdf5_write(data):     f = h5py.File("hdf5.h5", "w")     f.create_dataset("data", data=data)   def pickle_write(data):     with open("test.pkl", "wb") as f:         pickle.dump(data, f)   def pytables_write(data):     f = tables.open_file("pytables.h5", mode="w")     gcolumns = f.create_group(f.root, "columns", "data")     f.create_array(gcolumns, "data", data, "data")     f.close()   def zarr_write(data):     zarr.save("out.zarr", data)   perfplot.save(     "write.png",     setup=numpy.random.rand,     kernels=[npy_write, hdf5_write, pickle_write, pytables_write, zarr_write],     n_range=[2 ** k for k in range(28)],     xlabel="len(data)",     equality_check=None, )

Reading

enter image description here

pickles, pytables and hdf5 are roughly equally fast; pickles and zarr are slower for large arrays.

Code to reproduce the plot:

import perfplot import pickle import numpy import h5py import tables import zarr   def setup(n):     data = numpy.random.rand(n)     # write all files     #     numpy.save("out.npy", data)     #     f = h5py.File("out.h5", "w")     f.create_dataset("data", data=data)     f.close()     #     with open("test.pkl", "wb") as f:         pickle.dump(data, f)     #     f = tables.open_file("pytables.h5", mode="w")     gcolumns = f.create_group(f.root, "columns", "data")     f.create_array(gcolumns, "data", data, "data")     f.close()     #     zarr.save("out.zip", data)   def npy_read(data):     return numpy.load("out.npy")   def hdf5_read(data):     f = h5py.File("out.h5", "r")     out = f["data"][()]     f.close()     return out   def pickle_read(data):     with open("test.pkl", "rb") as f:         out = pickle.load(f)     return out   def pytables_read(data):     f = tables.open_file("pytables.h5", mode="r")     out = f.root.columns.data[()]     f.close()     return out   def zarr_read(data):     return zarr.load("out.zip")   b = perfplot.bench(     setup=setup,     kernels=[         npy_read,         hdf5_read,         pickle_read,         pytables_read,         zarr_read,     ],     n_range=[2 ** k for k in range(27)],     xlabel="len(data)", ) b.save("out2.png") b.show()

answered Sep 20 '22 20:09

Nico Schlömer

Related questions
                            
                                numpy float: 10x slower than builtin in arithmetic operations?
                            
                                How to display a 3D plot of a 3D array isosurface in matplotlib mplot3D or similar?
                            
                                Python - Activate conda env through shell script
                            
                                Writing robust (color and size invariant) circle detection with OpenCV (based on Hough transform or other features)
                            
                                List database tables with SQLAlchemy
                            
                                ImportError HDFStore requires PyTables No module named tables
                            
                                Tensorflow: How to replace or modify gradient?
                            
                                Why doesn't ignorecase flag (re.I) work in re.sub() [duplicate]
                            
                                How to add an model instance to a django queryset?
                            
                                How to apply itertools.product to elements of a list of lists?
                            
                                openpyxl get sheet by name
                            
                                Plot CDF + cumulative histogram using Seaborn Python
                            
                                How to prevent overlapping x-axis labels in sns.countplot
                            
                                flask blueprint template folder
                            
                                Automatically play sound in IPython notebook
                            
                                install_requires based on python version
                            
                                ImportError: No module named enum
                            
                                Plotting multiple different plots in one figure using Seaborn
                            
                                How do I convert a currency string to a floating point number in Python?
                            
                                Sending a form array to Flask

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest save and load options for a numpy array

Tags:

performance

python

arrays

io

numpy