I saved a couple of numpy arrays with np.save(), and put together they're quite huge.
Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?
The concatenate() function is a function from the NumPy package. This function essentially combines NumPy arrays together. This function is basically used for joining two or more arrays of the same shape along a specified axis.
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy's memmap's are array-like objects. This differs from Python's mmap module, which uses file-like objects.
NumPy: append() function The append() function is used to append values to the end of an given array. Values are appended to a copy of this array.
Using numpy.concatenate
apparently load the arrays into memory. To avoid this you can easily create a thrid memmap
array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.
For any case you must choose the right order for the array (row-major or column-major).
The following examples illustrate how to concatenate along axis 0 and axis 1.
1) concatenate along axis=0
a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222
You can define a third array reading the same file as the first array to be concatenated (here a
) in mode r+
(read and append), but with the shape of the final array you want to achieve after concatenation, like:
c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b
Concatenating along axis=0
does not require to pass order='C'
because this is already the default order.
2) concatenate along axis=1
a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222
The arrays saved on disk are actually flattened, so if you create c
with mode=r+
and shape=(5000,4000)
without changing the array order, the 1000
first elements from the second line in a
will go to the first in line in c
. But you can easily avoid this passing order='F'
(column-major) to memmap
:
c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b
Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.
Related questions:
Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read. I solved this issue with dask concatenation.
import numpy as np
import dask.array as da
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))
c = da.concatenate([a, b], axis=0)
This way one avoids the hacky additional file handle. The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute
.
Note that there are two caveats:
c[::2] = 0
is not possible, so creative solutions are necessary in those cases.store
methods should be used. This method can again accept a memmapped
array.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With