I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.
I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
auxData = load_data_from(file[i1])
mmapData[i1,:] = auxData
mmapData.flush() % Do this every 10 iterations or so
However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:
mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]
I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?
Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy's memmap's are array-like objects. This differs from Python's mmap module, which uses file-like objects.
A NumPy array can be specified to be stored in row-major format, using the keyword argument order='C' , and the column-major format, using the keyword argument order='F' , when the array is created or reshaped. The default format is row-major.
Memory-mapped files are accessed through the operating system's memory manager, so the file is automatically partitioned into a number of pages and accessed as needed. You do not have to handle the memory management yourself.
NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.
Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With