Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy: delete column from very big memory mapped Numpy Array

Assume I have a very big numpy memory mapped array:

fp = np.memmap("bigarray.mat", dtype='float32', mode='w+', shape=(5000000,5000))

Now after some manipulation, etc, I want to remove column 10:

fp = np.delete(fp,10,1)

This results in a out of memory error, because (??) the returned array is an in memory array. What I want is a pure memory mapped delete operation.

What is the most efficient way to delete columns in full memory mapped mode?

like image 229
robert Avatar asked Oct 19 '22 13:10

robert


1 Answers

Disclaimer: I always make a mess with rows and columns, so I may slip my tongue in this answer...

One important problem is that removing a non-contiguous chunk of data is a non-trivial matter. For example, consider a bit smaller example:

fp = np.memmap("bigarray.mat", dtype='float32', mode='w+', shape=(1000000,10000))

This memmap will have 10**10 elements, 4 bytes per element. That means that this structure will be somewhere near 40GB. It doesn't fit in my laptop memory, so it is ok to work with that.

The following will shift all rows, effectively deleting the 10th row:

for i in range(10, 999999):
    fp[i, :] = fp[i+1, :]

This works (is almost killing my OS, but works). However the following will break everything:

for i in range(10, 9999):
    fp[:, i] = fp[:, i+1]

This is because in order to change the column 11, you need to change all the rows. The layout in the file (and in memory), by default, is row-based. That means that you have to access a lot of different places to get all the required numbers in order to update.

My experience trying that is that everything becomes stall when the things starts to not fit in memory, I don't know if it is swapping or doing some cache. But, the effective behaviour is: it suddenly stops and doesn't do anything.

Of course you could make some better algorithm for memory access which wouldn't require to hold in memory the full rows and so on, but this is a level of optimization which I would not normally expect, because it is very cumbersome to implement, will be very slow (lots of random access to disk, if you don't have an SSD you are dead) and is not very common scenario.

If you must work with columns, you may want to change the order parameter when building your memmap. Fortran uses a memory layout based on columns instead of rows, so that will fix the column-deletion example. However, in that data structure, deleting a row would be the breaking operation.

This order parameter is explained in several places of numpy documentation:

[parameter: order, either 'C' or 'F'] Specify the order of the ndarray memory layout: row-major, C-style or column-major, Fortran-style. This only has an effect if the shape is greater than 1-D. The default order is ‘C’.


However take into account that, if you perform that "deletion", you will be moving a lot of GB. And because you cannot do that in memory (it does not fit), you will need to effectively modify the file. This would be a huge operation, which would be very slow. I would say that you maybe want some kind of extra logic to perform a "mask" or something like that. But at a higher level, not at the numpy level (although maybe it has some view class which encapsulates that, I am not entirely sure). You haven't told your use case, so I can only guess. But... you are working with a lot of data, moving it around is Bad Idea (TM).

like image 82
MariusSiuram Avatar answered Nov 03 '22 00:11

MariusSiuram