Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding performance of numpy memmap

I'm trying to better understand how numpy's memmap handles views of very large files. The script below opens a memory mapped 2048^3 array, and copies a downsampled 128^3 view of it

import numpy as np
from time import time

FILE = '/Volumes/BlackBox/test.dat'
array = np.memmap(FILE, mode='r', shape=(2048,2048,2048), dtype=np.float64)

t = time()
for i in range(5):
    view = np.array(array[::16, ::16, ::16])
t = ((time() - t) / 5) * 1000
print "Time (ms): %i" % t

Usually, this prints Time (ms): 80 or so. However, if I change the view assignment to

view = np.array(array[1::16, 2::16, 3::16])

and run it three times, I get the following:

Time (ms): 9988
Time (ms): 79
Time (ms): 78

Does anybody understand why the first invocation is so much slower?

like image 670
ChrisB Avatar asked Aug 06 '12 16:08

ChrisB


People also ask

How does NumPy Memmap work?

memmap() function. The memmap() function is used to create a memory-map to an array stored in a binary file on disk. Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.

What is Memmap in Python?

Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected; for example, you can use the re module to search through a memory-mapped file.

How do I save a large NumPy array?

You can save your NumPy arrays to CSV files using the savetxt() function. This function takes a filename and array as arguments and saves the array into CSV format. You must also specify the delimiter; this is the character used to separate each variable in the file, most commonly a comma.

How do I deactivate Memmap?

As far as I understand, there are currently two ways to close a memmap "file"; del fp or fp. _mmap. close() . However, the former only closes the file if fp is the only reference to the memmap and the latter crashes the python interpreter if there exists another reference to the memmap.


1 Answers

The OS still has portions (or all) of the mapped file available cached in physical RAM. The initial read has to access the disk, which is a lot slower than accessing RAM. Do enough other disk IO, and you'll find that you'll get back closer to your original time, where the OS has to re-read bits it hasn't cached from disk again...

like image 159
Jon Clements Avatar answered Sep 19 '22 15:09

Jon Clements