Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory mapped file for numpy arrays

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.

I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:

mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
   auxData = load_data_from(file[i1])
   mmapData[i1,:] = auxData
   mmapData.flush() % Do this every 10 iterations or so

However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:

mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]

I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?

Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?

like image 898
KartMan Avatar asked Oct 05 '14 15:10

KartMan


People also ask

What is NumPy memory-map?

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy's memmap's are array-like objects. This differs from Python's mmap module, which uses file-like objects.

How is a NumPy array stored in memory?

A NumPy array can be specified to be stored in row-major format, using the keyword argument order='C' , and the column-major format, using the keyword argument order='F' , when the array is created or reshaped. The default format is row-major.

Where are memory-mapped files stored?

Memory-mapped files are accessed through the operating system's memory manager, so the file is automatically partitioned into a number of pages and accessed as needed. You do not have to handle the memory management yourself.

Do NumPy arrays use less memory?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.


1 Answers

Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?

like image 185
Andrew Wagner Avatar answered Sep 20 '22 00:09

Andrew Wagner