Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing into a NumPy memmap still loads into RAM memory

I'm testing NumPy's memmap through IPython Notebook, with the following code

Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))

As you can see, Ymap's shape is pretty large. I'm trying to fill up Ymap like a sparse matrix. I'm not using scipy.sparse matrices because I will eventually need to dot-product it with another dense matrix, which will definitely not fit into memory.

Anyways, I'm performing a very long series of indexing operations:

Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
with open("somefile.txt", 'rb') as somefile:
    for i in xrange(5e6):
        # Read a line
        line = somefile.readline()
        # For each token in the line, lookup its j value
        # Assign the value 1.0 to Ymap[i,j]
        for token in line.split():
            j = some_dictionary[token]
            Ymap[i,j] = 1.0

These operations somehow quickly eat up my RAM. I thought mem-mapping was basically an out-of-core numpy.ndarray. Am I mistaken? Why is my memory usage sky-rocketing like crazy?

like image 889
richizy Avatar asked Dec 20 '13 22:12

richizy


2 Answers

A (non-anonymous) mmap is a link between a file and RAM that, roughly, guarantees that when RAM of the mmap is full, data will be paged to the given file instead of to the swap disk/file, and when you msync or munmap it, the whole region of RAM gets written out to the file. Operating systems typically follow a lazy strategy wrt. disk accesses (or eager wrt. RAM): data will remain in memory as long as it fits. This means a process with large mmaps will eat up as much RAM as it can/needs before spilling over the rest to disk.

So you're right that an np.memmap array is an out-of-core array, but it is one that will grab as much RAM cache as it can.

like image 64
Fred Foo Avatar answered Oct 07 '22 10:10

Fred Foo


As the docs say:

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.

There's no true magic in computers ;-) If you access very little of a giant array, a memmap gimmick will require very little RAM; if you access very much of a giant array, a memmap gimmick will require very much RAM.

One workaround that may or may not be helpful in your specific code: create new mmap objects periodically (and get rid of old ones), at logical points in your workflow. Then the amount of RAM needed should be roughly proportional to the number of array items you touch between such steps. Against that, it takes time to create and destroy new mmap objects. So it's a balancing act.

like image 35
Tim Peters Avatar answered Oct 07 '22 10:10

Tim Peters