Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MemoryError while creating cartesian product in Numpy

I have 3 numpy arrays and need to form the cartesian product between them. Dimensions of the arrays are not fixed, so they can take different values, one example could be A=(10000, 50), B=(40, 50), C=(10000,50).

Then, I perform some processing (like a+b-c) Below is the function that I am using for the product.

def cartesian_2d(arrays, out=None):

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.shape[0] for x in arrays])
    if out is None:
        out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype=dtype)

    m = n // arrays[0].shape[0]
    out[:, 0] = np.repeat(arrays[0], m, axis=0)
    if arrays[1:]:
        cartesian_2d(arrays[1:], out=out[0:m, 1:, :])
        for j in range(1, arrays[0].shape[0]):
            out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]
    return out

a = [[ 0, -0.02], [1, -0.15]]
b = [[0, 0.03]]

result = cartesian_2d([a,b,a])

// array([[[ 0.  , -0.02],
    [ 0.  ,  0.03],
    [ 0.  , -0.02]],

   [[ 0.  , -0.02],
    [ 0.  ,  0.03],
    [ 1.  , -0.15]],

   [[ 1.  , -0.15],
    [ 0.  ,  0.03],
    [ 0.  , -0.02]],

   [[ 1.  , -0.15],
    [ 0.  ,  0.03],  
    [ 1.  , -0.15]]])

The output is the same as with itertools.product. However, I am using my custom function to take advantage of numpy vectorized operations, which is working fine compared to itertools.product in my case.

After this, I do

result[:, 0, :] + result[:, 1, :] - result[:, 2, :]

//array([[ 0.  ,  0.03],
       [-1.  ,  0.16],
       [ 1.  , -0.1 ],
       [ 0.  ,  0.03]])

So this is the final expected result.

The function works as expected as long as my array fits in memory. But my usecase requires me to work with huge data and I get a MemoryError at the line np.empty() since it is unable to allocate the memory required. I am working with circa 20GB data at the moment and this might increase in future.

These arrays represent vectors and will have to be stored in float, so I cannot use int. Also, they are dense arrays, so using sparse is not an option.

I will be using these arrays for further processing and ideally I would not like to store them in files at this stage. So memmap / h5py format may not help, although I am not sure of this.

If there are other ways to form this product, that would be okay too.

As I am sure there are applications with way larger datasets than this, I hope someone has encountered such issues before and would like to know how to handle this issue. Please help.

like image 412
swathis Avatar asked Feb 17 '18 14:02

swathis


2 Answers

If at least your result fits in memory

The following produces your expected result without relying on an intermediate three times the size of the result. It uses broadcasting.

Please note that almost any NumPy operation is broadcastable like this, so in practice there is probably no need for an explicit cartesian product:

#shared dimensions:
sh = a.shape[1:]
aba = (a[:, None, None] + b[None, :, None] - a[None, None, :]).reshape(-1, *sh)
aba
#array([[ 0.  ,  0.03],
#       [-1.  ,  0.16],
#       [ 1.  , -0.1 ],
#       [ 0.  ,  0.03]])

Addressing result rows by 'ID'

You may consider leaving out the reshape. That would allow you to address the rows in the result by combined index. If your component ID's are just 0,1,2,... like in your example this would be the same as the combined ID. For example aba[1,0,0] would correspond to the row obtained as second row of a + first row of b - first row of a.

A bit of explanation

Broadcasting: When for example adding two arrays their shapes do not have to be identical, only compatible because of broadcasting. Broadcasting is in a sense a generalization of adding scalars to arrays:

    [[2],                 [[7],   [[2],
7 +  [3],     equiv to     [7], +  [3],
     [4]]                  [7]]    [4]]

Broadcasting:

              [[4],            [[1, 2, 3],   [[4, 4, 4],
[[1, 2, 3]] +  [5],  equiv to   [1, 2, 3], +  [5, 5, 5],
               [6]]             [1, 2, 3]]    [6, 6, 6]]

For this to work each dimension of each operand must be either 1 or equal to the corresponding dimension in each other operand unless it is 1. If an operand has fewer dimensions than the others its shape is padded with ones on the left. Note that the equiv arrays shown in the illustration are not explicitly created.

If the result also does not fit

In that case I don't see how you can possibly avoid using storage, so h5py or something like that it is.

Removing the first column from each operand

This is just a matter of slicing:

a_no_id = a[:, 1:]

etc. Note that, unlike Python lists, NumPy arrays when sliced do not return a copy but a view. Therefore efficiency (memory or runtime) is not an issue here.

like image 161
Paul Panzer Avatar answered Nov 14 '22 17:11

Paul Panzer


An alternate solution is to create a cartesian product of indices (which is easier, as solutions for cartesian products of 1D arrays exist):

idx = cartesian_product(
    np.arange(len(a)),
    np.arange(len(b)) + len(a),
    np.arange(len(a))
)

And then use fancy indexing to create the output array:

x = np.concatenate((a, b))
result = x[idx.ravel(), :].reshape(*idx.shape, -1)
like image 22
Nils Werner Avatar answered Nov 14 '22 18:11

Nils Werner