I'm testing NumPy's memmap through IPython Notebook, with the following code
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
As you can see, Ymap
's shape is pretty large. I'm trying to fill up Ymap
like a sparse matrix. I'm not using scipy.sparse
matrices because I will eventually need to dot-product it with another dense matrix, which will definitely not fit into memory.
Anyways, I'm performing a very long series of indexing operations:
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
with open("somefile.txt", 'rb') as somefile:
for i in xrange(5e6):
# Read a line
line = somefile.readline()
# For each token in the line, lookup its j value
# Assign the value 1.0 to Ymap[i,j]
for token in line.split():
j = some_dictionary[token]
Ymap[i,j] = 1.0
These operations somehow quickly eat up my RAM. I thought mem-mapping was basically an out-of-core numpy.ndarray
. Am I mistaken? Why is my memory usage sky-rocketing like crazy?
A (non-anonymous) mmap
is a link between a file and RAM that, roughly, guarantees that when RAM of the mmap
is full, data will be paged to the given file instead of to the swap disk/file, and when you msync
or munmap
it, the whole region of RAM gets written out to the file. Operating systems typically follow a lazy strategy wrt. disk accesses (or eager wrt. RAM): data will remain in memory as long as it fits. This means a process with large mmaps will eat up as much RAM as it can/needs before spilling over the rest to disk.
So you're right that an np.memmap
array is an out-of-core array, but it is one that will grab as much RAM cache as it can.
As the docs say:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
There's no true magic in computers ;-) If you access very little of a giant array, a memmap gimmick will require very little RAM; if you access very much of a giant array, a memmap gimmick will require very much RAM.
One workaround that may or may not be helpful in your specific code: create new mmap objects periodically (and get rid of old ones), at logical points in your workflow. Then the amount of RAM needed should be roughly proportional to the number of array items you touch between such steps. Against that, it takes time to create and destroy new mmap objects. So it's a balancing act.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With