Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct way to do operations on Memmapped arrays

Tags:

python

numpy

The operation I'm confused about looks like this. I've been doing this on regular Numpy arrays, but on a memmap I want to be informed about how it all works.

arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100
#This is basically to calculate Percentile rank of each value wrt the entire column

This is what I used on a normal numpy array.

Now. Considering arr1 is now a 20GB memmapped array, I have a few questions:

1:

arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100 

arr2 would be a regular numpy array, I'd assume? So executing this would be disastrous memory wise right?

Considering I've now created arr2 as a Memmapped array of correct size (filled with all zeroes).

2:

arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100

vs

arr2[:] = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100

What is the difference?

3.

Would it be more memory efficient to separately calculate np.argsort as a temporary memmapped array and np.argsort(np.argsort) as a temporary memmapped array and then do the operation? Since the argsort array of a 20GB array would itself be pretty huge!

I think these questions will help me get clarified about the inner workings of memmapped arrays in python!

Thanks...

like image 972
user1265125 Avatar asked Aug 29 '14 11:08

user1265125


People also ask

How do you deal with a large NumPy array?

Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory. A common solution is to use memory mapping and implement out-of-core computations. The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.

Can operations be carried out between scalars and arrays?

Operations between Arrays and Scalars This is usually called vectorization. Any arithmetic operations between equal-size arrays applies the operation elementwise: In [45]: arr = np.


1 Answers

I'm going to try to answer part 2 first then 1 and 3.

First, arr = <something> is simple variable assignment, whereas arr[:] = <something> assigns to the contents of the array. In the code below, after arr[:] = x, arr still is a memmapped array, whereas after arr = x, arr is a ndarray.

>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> type(arr)
<class 'numpy.core.memmap.memmap'>
>>> x = np.ones((1,10000000))
>>> type(x)
<class 'numpy.ndarray'>
>>> arr[:] = x
>>> type(arr)
<class 'numpy.core.memmap.memmap'>
>>> arr = x
>>> type(arr)
<class 'numpy.ndarray'>

In the case of np.argsort, it returns an array of the same type of its argument. So in this specific case, I'd think there should be no difference between doing arr = np.argsort(x) or arr[:] = np.argsort(x). In your code, arr2 will be a memmapped array. But there is a difference.

>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> x = np.ones((1,10000000))
>>> arr[:] = x
>>> type(np.argsort(x))
<class 'numpy.ndarray'>
>>> type(np.argsort(arr))
<class 'numpy.core.memmap.memmap'>

OK, now what is different. Using arr[:] = np.argsort(arr), if we look at changes to the memmapped file, we see that every change to arr is followed by a change in the file's md5sum.

>>> import os
>>> import numpy as np
>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> arr[:] = np.zeros((1,10000000))
>>> os.system("md5sum mm")
48e9a108a3ec623652e7988af2f88867  mm
0
>>> arr += 1.1
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1  mm
0
>>> arr[:] = np.argsort(arr)
>>> os.system("md5sum mm")
c3607e7de30240f3e0385b59491ac2ce  mm
0
>>> arr += 1.3
>>> os.system("md5sum mm")
1e6af2af114c70790224abe0e0e5f3f0  mm
0

We see that arr still retains its _mmap attribute.

>>> arr._mmap
<mmap.mmap object at 0x7f8e0f086198>

Now using arr = np.argsort(x), we see that the md5sums stop changing. Even though arr's type is memmapped array, it's a new object and it seems the memory mapping is dropped.

>>> import os
>>> import numpy as np
>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> arr[:] = np.zeros((1,10000000))
>>> os.system("md5sum mm")
48e9a108a3ec623652e7988af2f88867  mm
0
>>> arr += 1.1
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1  mm
0
>>> arr = np.argsort(arr)
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1  mm
0
>>> arr += 1.3
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1  mm
0
>>> type(arr)
<class 'numpy.core.memmap.memmap'>

Now the '_mmap' attribute is None.

>>> arr._mmap
>>> type(arr._mmap)
<class 'NoneType'>

Now part 3. It seems pretty easy to lose reference to the memmapped object when doing complex operations. My current understanding is that you'd have to break things down and use arr[:] = <> for intermediate results.

Using numpy 1.8.1 and Python 3.4.1

like image 176
A.P. Avatar answered Sep 16 '22 11:09

A.P.