Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming multiple numpy arrays to a file

This differs from Write multiple numpy arrays to file in that I need to be able to stream content, rather than writing it all at once.

I need to write multiple compressed numpy arrays in binary to a file. I can not store all the arrays in memory before writing so it is more like streaming numpy arrays to a file.

This currently works fine as text

file = open("some file")

while doing stuff: file.writelines(somearray + "\n") where some array is a new instance every loop

however this does not work if i try and write the arrays as binary.

arrays are created at 30hz and grow too big to keep in memory. They also can not each be stored into a bunch of single array files because that would just be wasteful and cause a huge mess.

So i would like only one file per a session instead of 10k files per a session.

like image 878
dtracers Avatar asked Nov 26 '17 05:11

dtracers


People also ask

How to save a single NumPy array to a compressed file?

The .npz file format is appropriate for this case and supports a compressed version of the native NumPy file format. The savez_compressed () NumPy function allows multiple NumPy arrays to be saved to a single compressed .npz file. We can use this function to save our single NumPy array to a compressed file.

How to save a NumPy array to an NPZ file?

The savez_compressed () NumPy function allows multiple NumPy arrays to be saved to a single compressed.npz file. 3.1 Example of Saving a NumPy Array to NPZ File We can use this function to save our single NumPy array to a compressed file. The complete example is listed below.

How to export a NumPy array to a CSV file?

You can use the following basic syntax to export a NumPy array to a CSV file: import numpy as np #define NumPy array data = np.array([ [1,2,3], [4,5,6], [7,8,9]]) #export array to CSV file np.savetxt("my_data.csv", data, delimiter=",") The following examples show how to use this syntax in practice. Example 1: Export NumPy Array to CSV

What is the best way to store data in NumPy?

Save NumPy Array to .CSV File (ASCII) The most common file format for storing numerical data in files is the comma-separated variable format, or CSV for short. It is most likely that your training data and input data to your models are stored in CSV files.


2 Answers

An NPZ file is just a zip archive, so you could save each array to a temporary NPY file, add that NPY file to the zip archive, and then delete the temporary file.

For example,

import os
import zipfile
import numpy as np


# File that will hold all the arrays.
filename = 'foo.npz'

with zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
    for i in range(10):
        # `a` is the array to be written to the file in this iteration.
        a = np.random.randint(0, 10, size=20)

        # Name for the temporary file to which `a` is written.  The root of this
        # filename is the name that will be assigned to the array in the npz file.
        # I've used 'arr_{}' (e.g. 'arr_0', 'arr_1', ...), similar to how `np.savez`
        # treats positional arguments.
        tmpfilename = "arr_{}.npy".format(i)

        # Save `a` to a npy file.
        np.save(tmpfilename, a)

        # Add the file to the zip archive.
        zf.write(tmpfilename)

        # Delete the npy file.
        os.remove(tmpfilename)

Here's an example where that script is run, and then the data is read back using np.load:

In [1]: !ls
add_array_to_zip.py

In [2]: run add_array_to_zip.py

In [3]: !ls
add_array_to_zip.py foo.npz

In [4]: foo = np.load('foo.npz')

In [5]: foo.files
Out[5]: 
['arr_0',
 'arr_1',
 'arr_2',
 'arr_3',
 'arr_4',
 'arr_5',
 'arr_6',
 'arr_7',
 'arr_8',
 'arr_9']

In [6]: foo['arr_0']
Out[6]: array([0, 9, 3, 7, 2, 2, 7, 2, 0, 5, 8, 1, 1, 0, 4, 2, 5, 1, 8, 2])

You'll have to test this on your system to see if it can keep up with your array generation process.


Another alternative is to use something like HDF5, with either h5py or pytables.

like image 55
Warren Weckesser Avatar answered Oct 13 '22 03:10

Warren Weckesser


One option might be to use pickle to save the arrays to a file opened as an append binary file:

import numpy as np
import pickle
arrays = [np.arange(n**2).reshape((n,n)) for n in range(1,11)]
with open('test.file', 'ab') as f:
    for array in arrays:
        pickle.dump(array, f)

new_arrays = []        
with open('test.file', 'rb') as f:
    while True:
        try:
            new_arrays.append(pickle.load(f))
        except EOFError:
            break
assert all((new_array == array).all() for new_array, array in zip(new_arrays, arrays))

This might not be the fastest, but it should be fast enough. It might seem like this would take up more data, but comparing these:

x = 300
y = 300
arrays = [np.random.randn(x, y) for x in range(30)]

with open('test2.file', 'ab') as f:
    for array in arrays:
        pickle.dump(array, f)

with open('test3.file', 'ab') as f:
    for array in arrays:
        f.write(array.tobytes())

with open('test4.file', 'ab') as f:
    for array in arrays:
        np.save(f, array)

You'll find the file sizes as 1,025 KB, 1,020 KB, and 1,022 KB respectively.

like image 37
Sebastian Mendez Avatar answered Oct 13 '22 03:10

Sebastian Mendez