Currently I am working on quite huge dataset which barely fits into my memory, so I use np.memmap
. But at some point I have to split my dataset into training and test.
I have found such case when I want to slice np.memmap
using some index array:
(Below you can find code and mem allocations)
Line # Mem usage Increment Line Contents
================================================
7 29.340 MB 0.000 MB def my_func2():
8 29.340 MB 0.000 MB ARR_SIZE = (1221508/4,430)
9 29.379 MB 0.039 MB big_mmap = np.memmap('big_mem_test.mmap',shape=ARR_SIZE, dtype=np.float64, mode='r')
10 38.836 MB 9.457 MB idx = range(ARR_SIZE[0])
11 2042.605 MB 2003.770 MB sub = big_mmap[idx,:]
12 3046.766 MB 1004.160 MB sub2 = big_mmap[idx,:]
13 3046.766 MB 0.000 MB return type(sub)
But if I like to take continous slice I would use rather this code:
Line # Mem usage Increment Line Contents
================================================
15 29.336 MB 0.000 MB def my_func3():
16 29.336 MB 0.000 MB ARR_SIZE = (1221508/4,430)
17 29.375 MB 0.039 MB big_mmap = np.memmap('big_mem_test.mmap',shape=ARR_SIZE, dtype=np.float64, mode='r')
18 29.457 MB 0.082 MB sub = big_mmap[0:1221508/4,:]
19 29.457 MB 0.000 MB sub2 = big_mmap[0:1221508/4,:]
Notice that in second example in lines 18,19 there is no memory allocation and whole operation is a lot faster.
In first example in line 11 there is alocation so whole big_mmap
matrix is readed during slicing. But what is more suprising in line 12 there is another alocation. Doing more such operation you can easily run out of memory.
When I split my data set indexes are rather random and not continous so I cannot use big_mmap[start:end,:]
notation.
My question is:
Is there any other method which allow me to slice memmap without reading whole data to memory?
Why whole matrix is readed to memory when slicing with index (example one)?
Why data is readed and alocated again (first example line 12)?
The double-allocation you are seeing in your first example isn't due to memmap behaviour; rather, it is due to how __getitem__
is implemented for numpy's ndarray class. When an ndarray is indexed using a list (as in your first example), data are copied from the source array. When it is indexed using a slice object, a view is created into the source array (no data are copied). For example:
In [2]: x = np.arange(16).reshape((4,4))
In [3]: x
Out[3]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [4]: y = x[[0, 2], :]
In [5]: y[:, :] = 100
In [6]: x
Out[6]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
y
is a copy of data from x
so changing y
had no effect on x
. Now index the array via slicing:
In [7]: z = x[::2, :]
In [8]: z[:, :] = 100
In [9]: x
Out[9]:
array([[100, 100, 100, 100],
[ 4, 5, 6, 7],
[100, 100, 100, 100],
[ 12, 13, 14, 15]])
Regarding your first question, I'm not aware of a method that will allow you to create arbitrary slices that include with entire array without reading the entire array into memory. Two options you might consider (in addition to something like HDF5/PyTables, which you already discussed):
If you are accessing elements of you training & test sets sequentially (rather than operating on them as two entire arrays), you could easily write a small wrapper class whose __getitem__
method uses your index arrays to pull the appropriate sample from the memmap (i.e., training[i] returns big_mmap[training_ids[i]])
Split your array into two separate files, which contain exclusively training or test values. Then you could use two separate memmap objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With