Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to np.concatenate memory-mapped files?

I saved a couple of numpy arrays with np.save(), and put together they're quite huge.

Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?

like image 430
vedran Avatar asked Dec 08 '12 19:12

vedran


People also ask

What is NP concatenate?

The concatenate() function is a function from the NumPy package. This function essentially combines NumPy arrays together. This function is basically used for joining two or more arrays of the same shape along a specified axis.

What is NumPy memory-map?

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy's memmap's are array-like objects. This differs from Python's mmap module, which uses file-like objects.

Does NP append make a copy?

NumPy: append() function The append() function is used to append values to the end of an given array. Values are appended to a copy of this array.


2 Answers

Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.

For any case you must choose the right order for the array (row-major or column-major).

The following examples illustrate how to concatenate along axis 0 and axis 1.


1) concatenate along axis=0

a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222

You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:

c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b

Concatenating along axis=0 does not require to pass order='C' because this is already the default order.


2) concatenate along axis=1

a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222

The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:

c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b

Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.

Related questions:

  • Working with big data in python and numpy, not enough ram, how to save partial results on disc?
like image 174
Saullo G. P. Castro Avatar answered Nov 14 '22 23:11

Saullo G. P. Castro


Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read. I solved this issue with dask concatenation.

import numpy as np
import dask.array as da
 
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))

c = da.concatenate([a, b], axis=0)

This way one avoids the hacky additional file handle. The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute.

Note that there are two caveats:

  1. it is not possible to do in-place re-assignment e.g. c[::2] = 0 is not possible, so creative solutions are necessary in those cases.
  2. this also means the original files can no longer be updated. To save results out, the dask store methods should be used. This method can again accept a memmapped array.
like image 34
DIN14970 Avatar answered Nov 14 '22 23:11

DIN14970