This differs from Write multiple numpy arrays to file in that I need to be able to stream content, rather than writing it all at once.
I need to write multiple compressed numpy arrays in binary to a file. I can not store all the arrays in memory before writing so it is more like streaming numpy arrays to a file.
This currently works fine as text
file = open("some file")
while doing stuff: file.writelines(somearray + "\n") where some array is a new instance every loop
however this does not work if i try and write the arrays as binary.
arrays are created at 30hz and grow too big to keep in memory. They also can not each be stored into a bunch of single array files because that would just be wasteful and cause a huge mess.
So i would like only one file per a session instead of 10k files per a session.
The .npz file format is appropriate for this case and supports a compressed version of the native NumPy file format. The savez_compressed () NumPy function allows multiple NumPy arrays to be saved to a single compressed .npz file. We can use this function to save our single NumPy array to a compressed file.
The savez_compressed () NumPy function allows multiple NumPy arrays to be saved to a single compressed.npz file. 3.1 Example of Saving a NumPy Array to NPZ File We can use this function to save our single NumPy array to a compressed file. The complete example is listed below.
You can use the following basic syntax to export a NumPy array to a CSV file: import numpy as np #define NumPy array data = np.array([ [1,2,3], [4,5,6], [7,8,9]]) #export array to CSV file np.savetxt("my_data.csv", data, delimiter=",") The following examples show how to use this syntax in practice. Example 1: Export NumPy Array to CSV
Save NumPy Array to .CSV File (ASCII) The most common file format for storing numerical data in files is the comma-separated variable format, or CSV for short. It is most likely that your training data and input data to your models are stored in CSV files.
An NPZ file is just a zip archive, so you could save each array to a temporary NPY file, add that NPY file to the zip archive, and then delete the temporary file.
For example,
import os
import zipfile
import numpy as np
# File that will hold all the arrays.
filename = 'foo.npz'
with zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
for i in range(10):
# `a` is the array to be written to the file in this iteration.
a = np.random.randint(0, 10, size=20)
# Name for the temporary file to which `a` is written. The root of this
# filename is the name that will be assigned to the array in the npz file.
# I've used 'arr_{}' (e.g. 'arr_0', 'arr_1', ...), similar to how `np.savez`
# treats positional arguments.
tmpfilename = "arr_{}.npy".format(i)
# Save `a` to a npy file.
np.save(tmpfilename, a)
# Add the file to the zip archive.
zf.write(tmpfilename)
# Delete the npy file.
os.remove(tmpfilename)
Here's an example where that script is run, and then the data is read back using np.load
:
In [1]: !ls
add_array_to_zip.py
In [2]: run add_array_to_zip.py
In [3]: !ls
add_array_to_zip.py foo.npz
In [4]: foo = np.load('foo.npz')
In [5]: foo.files
Out[5]:
['arr_0',
'arr_1',
'arr_2',
'arr_3',
'arr_4',
'arr_5',
'arr_6',
'arr_7',
'arr_8',
'arr_9']
In [6]: foo['arr_0']
Out[6]: array([0, 9, 3, 7, 2, 2, 7, 2, 0, 5, 8, 1, 1, 0, 4, 2, 5, 1, 8, 2])
You'll have to test this on your system to see if it can keep up with your array generation process.
Another alternative is to use something like HDF5, with either h5py or pytables.
One option might be to use pickle to save the arrays to a file opened as an append binary
file:
import numpy as np
import pickle
arrays = [np.arange(n**2).reshape((n,n)) for n in range(1,11)]
with open('test.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
new_arrays = []
with open('test.file', 'rb') as f:
while True:
try:
new_arrays.append(pickle.load(f))
except EOFError:
break
assert all((new_array == array).all() for new_array, array in zip(new_arrays, arrays))
This might not be the fastest, but it should be fast enough. It might seem like this would take up more data, but comparing these:
x = 300
y = 300
arrays = [np.random.randn(x, y) for x in range(30)]
with open('test2.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
with open('test3.file', 'ab') as f:
for array in arrays:
f.write(array.tobytes())
with open('test4.file', 'ab') as f:
for array in arrays:
np.save(f, array)
You'll find the file sizes as 1,025 KB, 1,020 KB, and 1,022 KB respectively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With