The operation I'm confused about looks like this. I've been doing this on regular Numpy arrays, but on a memmap I want to be informed about how it all works.
arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100
#This is basically to calculate Percentile rank of each value wrt the entire column
This is what I used on a normal numpy array.
Now. Considering arr1 is now a 20GB memmapped array, I have a few questions:
1:
arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100
arr2 would be a regular numpy array, I'd assume? So executing this would be disastrous memory wise right?
Considering I've now created arr2
as a Memmapped array of correct size (filled with all zeroes).
2:
arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100
vs
arr2[:] = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100
What is the difference?
3.
Would it be more memory efficient to separately calculate np.argsort
as a temporary memmapped array and np.argsort(np.argsort)
as a temporary memmapped array and then do the operation? Since the argsort array of a 20GB array would itself be pretty huge!
I think these questions will help me get clarified about the inner workings of memmapped arrays in python!
Thanks...
Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory. A common solution is to use memory mapping and implement out-of-core computations. The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.
Operations between Arrays and Scalars This is usually called vectorization. Any arithmetic operations between equal-size arrays applies the operation elementwise: In [45]: arr = np.
I'm going to try to answer part 2 first then 1 and 3.
First, arr = <something>
is simple variable assignment, whereas arr[:] = <something>
assigns to the contents of the array. In the code below, after arr[:] = x
, arr
still is a memmapped array, whereas after arr = x
, arr
is a ndarray.
>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> type(arr)
<class 'numpy.core.memmap.memmap'>
>>> x = np.ones((1,10000000))
>>> type(x)
<class 'numpy.ndarray'>
>>> arr[:] = x
>>> type(arr)
<class 'numpy.core.memmap.memmap'>
>>> arr = x
>>> type(arr)
<class 'numpy.ndarray'>
In the case of np.argsort
, it returns an array of the same type of its argument.
So in this specific case, I'd think there should be no difference between doing arr = np.argsort(x)
or arr[:] = np.argsort(x)
. In your code, arr2
will be a memmapped array. But there is a difference.
>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> x = np.ones((1,10000000))
>>> arr[:] = x
>>> type(np.argsort(x))
<class 'numpy.ndarray'>
>>> type(np.argsort(arr))
<class 'numpy.core.memmap.memmap'>
OK, now what is different. Using arr[:] = np.argsort(arr)
, if we look at changes to the memmapped file, we see that every change to arr is followed by a change in the file's md5sum.
>>> import os
>>> import numpy as np
>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> arr[:] = np.zeros((1,10000000))
>>> os.system("md5sum mm")
48e9a108a3ec623652e7988af2f88867 mm
0
>>> arr += 1.1
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1 mm
0
>>> arr[:] = np.argsort(arr)
>>> os.system("md5sum mm")
c3607e7de30240f3e0385b59491ac2ce mm
0
>>> arr += 1.3
>>> os.system("md5sum mm")
1e6af2af114c70790224abe0e0e5f3f0 mm
0
We see that arr
still retains its _mmap
attribute.
>>> arr._mmap
<mmap.mmap object at 0x7f8e0f086198>
Now using arr = np.argsort(x)
, we see that the md5sums stop changing. Even though arr
's type is memmapped array, it's a new object and it seems the memory mapping is dropped.
>>> import os
>>> import numpy as np
>>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000))
>>> arr[:] = np.zeros((1,10000000))
>>> os.system("md5sum mm")
48e9a108a3ec623652e7988af2f88867 mm
0
>>> arr += 1.1
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1 mm
0
>>> arr = np.argsort(arr)
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1 mm
0
>>> arr += 1.3
>>> os.system("md5sum mm")
b8efebf72a02f9c0b93c0bbcafaf8cb1 mm
0
>>> type(arr)
<class 'numpy.core.memmap.memmap'>
Now the '_mmap' attribute is None.
>>> arr._mmap
>>> type(arr._mmap)
<class 'NoneType'>
Now part 3. It seems pretty easy to lose reference to the memmapped object when doing complex operations. My current understanding is that you'd have to break things down and use arr[:] = <>
for intermediate results.
Using numpy 1.8.1 and Python 3.4.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With