Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does numpy's memmap copy-on-write mode work?

Tags:

python

numpy

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.

I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).

However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.

Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?

Here's an example of a test which I expected to fail:

On a large memory system, create the array:

import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB

Now, on a machine with just 2 Gb of memory, this fails as expected:

a = np.load('a.npy')

But these two will succeed, as expected:

a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')

Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):

for i in range(a.shape[0]):
    print('row {}'.format(i))
    a[i,:] = i*np.arange(a.shape[1])

Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?

Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?

for i in range(a.shape[0]):
    if i % 100 == 0:
        print('row {}'.format(i))
        a.flush()
    a[i,:] = i*np.arange(a.shape[1])
like image 773
Amir Avatar asked Jan 02 '19 21:01

Amir


1 Answers

Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:

accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.

On linux, this works by calling the mmap system call with

MAP_PRIVATE

Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.

Regarding your question

The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?

The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

like image 173
Eric Avatar answered Nov 14 '22 22:11

Eric