This differs from Write multiple numpy arrays to file in that I need to be able to stream content, rather than writing it all at once. I need to write multiple compressed numpy arrays in binary to a file. I can not store all the arrays in memory before writing so it is more like streaming numpy arrays to a file. This currently works fine as text file = open("some file") while doing stuff: file.writelines(somearray + "\n") where some array is a new instance every loop however this does not work if i try and write the arrays as binary. arrays are created at 30hz and grow too big to keep in memory. They also can not each be stored into a bunch of single array files because that would just be wasteful and cause a huge mess. So i would like only one file per a session instead of 10k files per a session.

One option might be to use pickle to save the arrays to a file opened as an <code>append binary</code> file: <pre class="prettyprint"><code>import numpy as np import pickle arrays = [np.arange(n**2).reshape((n,n)) for n in range(1,11)] with open('test.file', 'ab') as f: for array in arrays: pickle.dump(array, f) new_arrays = [] with open('test.file', 'rb') as f: while True: try: new_arrays.append(pickle.load(f)) except EOFError: break assert all((new_array == array).all() for new_array, array in zip(new_arrays, arrays)) </code></pre> This might not be the fastest, but it should be fast enough. It might seem like this would take up more data, but comparing these: <pre class="prettyprint"><code>x = 300 y = 300 arrays = [np.random.randn(x, y) for x in range(30)] with open('test2.file', 'ab') as f: for array in arrays: pickle.dump(array, f) with open('test3.file', 'ab') as f: for array in arrays: f.write(array.tobytes()) with open('test4.file', 'ab') as f: for array in arrays: np.save(f, array) </code></pre> You'll find the file sizes as 1,025 KB, 1,020 KB, and 1,022 KB respectively.

Streaming multiple numpy arrays to a file

2 Answers

An NPZ file is just a zip archive, so you could save each array to a temporary NPY file, add that NPY file to the zip archive, and then delete the temporary file.

For example,

import os
import zipfile
import numpy as np


# File that will hold all the arrays.
filename = 'foo.npz'

with zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
    for i in range(10):
        # `a` is the array to be written to the file in this iteration.
        a = np.random.randint(0, 10, size=20)

        # Name for the temporary file to which `a` is written.  The root of this
        # filename is the name that will be assigned to the array in the npz file.
        # I've used 'arr_{}' (e.g. 'arr_0', 'arr_1', ...), similar to how `np.savez`
        # treats positional arguments.
        tmpfilename = "arr_{}.npy".format(i)

        # Save `a` to a npy file.
        np.save(tmpfilename, a)

        # Add the file to the zip archive.
        zf.write(tmpfilename)

        # Delete the npy file.
        os.remove(tmpfilename)

Here's an example where that script is run, and then the data is read back using np.load:

In [1]: !ls
add_array_to_zip.py

In [2]: run add_array_to_zip.py

In [3]: !ls
add_array_to_zip.py foo.npz

In [4]: foo = np.load('foo.npz')

In [5]: foo.files
Out[5]: 
['arr_0',
 'arr_1',
 'arr_2',
 'arr_3',
 'arr_4',
 'arr_5',
 'arr_6',
 'arr_7',
 'arr_8',
 'arr_9']

In [6]: foo['arr_0']
Out[6]: array([0, 9, 3, 7, 2, 2, 7, 2, 0, 5, 8, 1, 1, 0, 4, 2, 5, 1, 8, 2])

You'll have to test this on your system to see if it can keep up with your array generation process.

Another alternative is to use something like HDF5, with either h5py or pytables.

answered Oct 13 '22 03:10

Warren Weckesser

One option might be to use pickle to save the arrays to a file opened as an append binary file:

import numpy as np
import pickle
arrays = [np.arange(n**2).reshape((n,n)) for n in range(1,11)]
with open('test.file', 'ab') as f:
    for array in arrays:
        pickle.dump(array, f)

new_arrays = []        
with open('test.file', 'rb') as f:
    while True:
        try:
            new_arrays.append(pickle.load(f))
        except EOFError:
            break
assert all((new_array == array).all() for new_array, array in zip(new_arrays, arrays))

This might not be the fastest, but it should be fast enough. It might seem like this would take up more data, but comparing these:

x = 300
y = 300
arrays = [np.random.randn(x, y) for x in range(30)]

with open('test2.file', 'ab') as f:
    for array in arrays:
        pickle.dump(array, f)

with open('test3.file', 'ab') as f:
    for array in arrays:
        f.write(array.tobytes())

with open('test4.file', 'ab') as f:
    for array in arrays:
        np.save(f, array)

You'll find the file sizes as 1,025 KB, 1,020 KB, and 1,022 KB respectively.

answered Oct 13 '22 03:10

Sebastian Mendez

Related questions
                            
                                Django Rest Framework - OPTIONS request - Get foreign key choices
                            
                                Any limitations on platform constraints for wheels on PyPI?
                            
                                Is there a callable equivalent to f-string syntax?
                            
                                Poisson Regression in xgboost Fails for Low Frequencies
                            
                                Populate second dropdown based on the value selected in the first dropdown in flask using ajax and jQuery
                            
                                Google PubSub python client returning StatusCode.UNAVAILABLE
                            
                                How do you ensure a Celery chord callback gets called with failed subtasks?
                            
                                Set the HTTP status text in a Flask response
                            
                                Element disappears when I add an {% include %} tag inside my for loop
                            
                                URL path parameters vs query parameters in Django
                            
                                Python Error When Installing ez_setup.py "could not create SSL/TLS secure channel"
                            
                                Not clicking all tabs and not looping once issues
                            
                                Pygame - Loading images in sprites
                            
                                Matplotlib path.contains_points returns false for points on some edges but not others
                            
                                Pandas manipulating a DataFrame inplace vs not inplace (inplace=True vs False) [duplicate]
                            
                                Chaining string operations on Pandas Series
                            
                                Pandas counting occurrence of list contained in column of lists
                            
                                SQLAlchemy: How to filter after aggregation
                            
                                Loading Excel file chunk by chunk with Python instead of loading full file into memory
                            
                                How do I force pip to install from the last commit of a branch in a repo?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Streaming multiple numpy arrays to a file

Tags:

python

arrays

file-io

binaryfiles

numpy

dtracers

People also ask

2 Answers

Warren Weckesser

Sebastian Mendez

Recent Activity

Donate For Us