I saved a couple of numpy arrays with np.save(), and put together they're quite huge. Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?

Using <code>numpy.concatenate</code> apparently load the arrays into memory. To avoid this you can easily create a thrid <code>memmap</code> array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk. For any case you must choose the right order for the array (row-major or column-major). The following examples illustrate how to concatenate along axis 0 and axis 1. <hr> 1) concatenate along <code>axis=0</code> <pre class="prettyprint"><code>a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB a[:,:] = 111 b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB b[:,:] = 222 </code></pre> You can define a third array reading the same file as the first array to be concatenated (here <code>a</code>) in mode <code>r+</code> (read and append), but with the shape of the final array you want to achieve after concatenation, like: <pre class="prettyprint"><code>c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C') c[5000:,:] = b </code></pre> Concatenating along <code>axis=0</code> does not require to pass <code>order='C'</code> because this is already the default order. <hr> 2) concatenate along <code>axis=1</code> <pre class="prettyprint"><code>a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB a[:,:] = 111 b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB b[:,:] = 222 </code></pre> The arrays saved on disk are actually flattened, so if you create <code>c</code> with <code>mode=r+</code> and <code>shape=(5000,4000)</code> without changing the array order, the <code>1000</code> first elements from the second line in <code>a</code> will go to the first in line in <code>c</code>. But you can easily avoid this passing <code>order='F'</code> (column-major) to <code>memmap</code>: <pre class="prettyprint"><code>c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F') c[:, 3000:] = b </code></pre> <hr> Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two. Related questions: <ul> <li>Working with big data in python and numpy, not enough ram, how to save partial results on disc?</li> </ul>

Is it possible to np.concatenate memory-mapped files?

2 Answers

Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.

For any case you must choose the right order for the array (row-major or column-major).

The following examples illustrate how to concatenate along axis 0 and axis 1.

1) concatenate along axis=0

a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222

You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:

c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b

Concatenating along axis=0 does not require to pass order='C' because this is already the default order.

2) concatenate along axis=1

a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222

The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:

c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b

Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.

Saullo G. P. Castro

Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read. I solved this issue with dask concatenation.

import numpy as np
import dask.array as da
 
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))

c = da.concatenate([a, b], axis=0)

This way one avoids the hacky additional file handle. The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute.

Note that there are two caveats:

it is not possible to do in-place re-assignment e.g. c[::2] = 0 is not possible, so creative solutions are necessary in those cases.
this also means the original files can no longer be updated. To save results out, the dask store methods should be used. This method can again accept a memmapped array.

answered Nov 14 '22 23:11

DIN14970

Related questions
                            
                                Poetry and PyTorch
                            
                                re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)
                            
                                How to concat multiple Pandas DataFrame columns with different token separator?
                            
                                Pandas check if value in one multiindex column is in any column, same row of different multiindex
                            
                                Gunicorn worker terminated with signal 9
                            
                                Are Python list comprehensions the same thing as map/grep in Perl?
                            
                                Django - accessing the RequestContext from within a custom filter
                            
                                Advice on Python Parser Generators
                            
                                How do I get rid of the "u" from a decoded JSON object?
                            
                                SQLAlchemy circular dependency - how to solve it?
                            
                                IP address by Domain Name
                            
                                Don't parse options after the last positional argument
                            
                                psycopg - Get formatted sql instead of executing
                            
                                How do I import a module from a parent directory? (unittest purposes)
                            
                                Construct a tree from list os file paths (Python) - Performance dependent
                            
                                How to set the margins for a matplotlib figure?
                            
                                Implementing python slice notation
                            
                                How to extract a JSON object that was defined in a HTML page javascript block using Python?
                            
                                How do I POST with jQuery/Ajax in Django?
                            
                                wtforms hidden field value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to np.concatenate memory-mapped files?

Tags:

python

arrays

numpy

memory-mapped-files

vedran

People also ask

2 Answers

Saullo G. P. Castro

DIN14970

Recent Activity

Donate For Us