numpy: efficient, large dot products

Question

I am trying to perform a large linear-algebra computation to transform a generic covariance matrix KK_l_obs (shape (NL, NL))into a map of covariance matrices in a reduced space Kmap_PC (shape (q, q, X, Y)).

Information about how to construct Kmap_PC for each spatial location is held in other arrays a, I0, and k_l_th. The first two have shapes (X, Y), and the third (nl, nl). The transformation between the observed and reduced space is handed by eingenvectors E (shape (q, nl)). Note that NL > nl.

A spatial element of Kmap_PC is calculated as:

Kmap_PC[..., X, Y] = E.dot(
    KK_l_obs[I0[X, Y]: I0[X, Y] + nl,
             I0[X, Y]: I0[X, Y] + nl] / a_map[X, Y] + \
    k_l_th).dot(E.T)

The bit inside the first dot product could theoretically be computed straight using np.einsum, but would take up hundreds of GB of memory. What I am doing now is looping through the spatial indices of Kmap_PC, which is pretty slow. I could also distribute the calculation using MPI (which could probably give a speedup of 3-4x, since I have 16 cores available).

I'm wondering:

(a) if I can do the computation more efficiently--perhaps explicitly breaking it down into groups of spatial elements; and

(b) if I can improve the memory overhead for those calculations.

Code snippet

import numpy as np
np.random.seed(1)

X = 10
Y = 10
NL = 3000
nl = 1000
q = 7

a_map = 5. * np.random.rand(X, Y)
E = np.random.randn(q, nl)

# construct constant component
m1_ = .05 * np.random.rand(nl, nl)
k_l_th = m1_.dot(m1_)

# construct variable component
m2_ = np.random.rand(NL, NL)
KK_l_obs = m2_.dot(m2_.T)

# where to start in big cov
I0 = np.random.randint(0, NL - nl, (X, Y))

# the slow way
def looping():
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))

    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] / a_map[si[0], si[1]] + k_l_th).dot(E.T)

    return K_PC

def veccalc():
    nl_ = np.arange(nl)[..., None, None]
    I, J = np.meshgrid(nl_, nl_)

    K_s = KK_l_obs[I0[..., None, None] + J, I0[..., None, None] + I]
    K_s = K_s / a_map[..., None, None] + k_l_th[None, None, ...]
    print(K_s.nbytes)

    K_PC = E @ K_s @ E.T
    K_PC = np.moveaxis(K_PC, [0, 1], [-2, -1])

    return K_PC

Divakar · Accepted Answer

Tweak #1

One very simple performance tweak that's mostly ignored in NumPy is avoiding the use of division and using multiplication. This is not noticeable when dealing with scalar to scalar or array to array divisions when dealing with equal shaped arrays. But NumPy's implicit broadcasting makes it interesting for divisions that allow for broadcasting between arrays of different shapes or between an array and scalar. For those cases, we could get noticeable boost using multiplication with the reciprocal numbers. Thus, for the stated problem, we would pre-compute the reciprocal of a_map and use those for multiplications in place of divisions.

So, at the start do :

r_a_map = 1.0/a_map

Then, within the nested loops, use it as :

KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * r_a_map[si[0], si[1]]

Tweak #2

We could use associative property of multiplication there :

A*(B + C) = A*B + A*C

Thus, k_l_th that is summed across all iterations but stays constant could be taken outside of the loop and summed up after getting out of the nested loops. It's effective summation would be : E.dot(k_l_th).dot(E.T). So, we would add this to K_PC.

Finalizing and benchmarking

Using tweak #1 and tweak#2, we would end up with a modified approach, like so -

def original_mod_app():
    r_a_map = 1.0/a_map
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))
    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * \
            r_a_map[si[0], si[1]]).dot(E.T)
    return K_PC + E.dot(k_l_th).dot(E.T)[:,:,None,None]

Runtime test with the same sample setup as used in the question -

In [458]: %timeit original_app()
1 loops, best of 3: 1.4 s per loop

In [459]: %timeit original_mod_app()
1 loops, best of 3: 677 ms per loop

In [460]: np.allclose(original_app(), original_mod_app())
Out[460]: True

So, we are getting a speedup of 2x+ there.

hpaulj · Answer

On a relatively modest machine (4G memory) a matmul calc on the whole 10x10x1000x1000 space works.

def looping2(n=2):
    ktemp = np.empty((n,n,nl,nl))
    for i,j in np.ndindex(ktemp.shape[:2]):
        I0_ = I0[i, j]
        temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
        temp = temp / a_map[i,j] + k_l_th
        ktemp[i,j,...] = temp
    K_PC = E @ ktemp @ E.T      
    return K_PC

K = loop()
k4 = looping2(n=X)
np.allclose(k4, K.transpose(2,3,0,1))  # true

I haven't tried to vectorize the IO_ mapping. My focus is on generalizing the double dot product.

The equivalent einsum is:

K_PC = np.einsum('ij,...jk,lk->il...', E, ktemp, E)

That produces a ValueError: iterator is too large error for n=7.

But with the latest version

K_PC = np.einsum('ij,...jk,lk->il...', E, ktemp, E, optimize='optimal')

does work for the full 7x7x10x10 output.

Timings aren't promising. 2.2sec for the original looping, 3.9s for the big matmul (or einsum). (I get the same 2x speedup with original_mod_app)

============

time for constructing a (10,10,1000,1000) array (iteratively):

In [31]: %%timeit 
    ...:     ktemp = np.empty((n,n,nl,nl))
    ...:     for i,j in np.ndindex(ktemp.shape[:2]):
    ...:         I0_ = I0[i, j]
    ...:         temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
    ...:         ktemp[i,j,...] = temp
    ...:     
1 loop, best of 3: 749 ms per loop

time for reducing that to (10,10,7,7) with @ (longer than the construction)

In [32]: timeit E @ ktemp @ E.T
1 loop, best of 3: 1.17 s per loop

time for the same two operations, but with the reduction in the loop

In [33]: %%timeit 
    ...:     ktemp = np.empty((n,n,q,q))
    ...:     for i,j in np.ndindex(ktemp.shape[:2]):
    ...:         I0_ = I0[i, j]
    ...:         temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
    ...:         ktemp[i,j,...] = E @ temp @ E.T

1 loop, best of 3: 858 ms per loop

Performing the dot product within the loop reduces the size of the subarrays that are saved to ktemp, thus making up for the calculation cost. The dot operation on the big array is, by itself, more expensive than your loop. Even if we could 'vectorize' KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] it wouldn't make up for the cost handling that big array.

numpy: efficient, large dot products

Tags:

performance

python

numpy

DathosPachy

2 Answers

Tweak #1

Tweak #2

Finalizing and benchmarking

Divakar

hpaulj

Recent Activity

Donate For Us

numpy: efficient, large dot products

Tags:

performance

python

numpy

DathosPachy

2 Answers

Tweak #1

Tweak #2

Finalizing and benchmarking

Divakar

hpaulj

Related questions

Recent Activity

Donate For Us