Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate over all pairwise combinations of numpy array columns

I have an numpy array of size

arr.size = (200, 600, 20). 

I want to compute scipy.stats.kendalltau on every pairwise combination of the last two dimensions. For example:

kendalltau(arr[:, 0, 0], arr[:, 1, 0])
kendalltau(arr[:, 0, 0], arr[:, 1, 1])
kendalltau(arr[:, 0, 0], arr[:, 1, 2])
...
kendalltau(arr[:, 0, 0], arr[:, 2, 0])
kendalltau(arr[:, 0, 0], arr[:, 2, 1])
kendalltau(arr[:, 0, 0], arr[:, 2, 2])
...
...
kendalltau(arr[:, 598, 20], arr[:, 599, 20])

such that I cover all combinations of arr[:, i, xi] with arr[:, j, xj] with i < j and xi in [0,20), xj in [0, 20). This is (600 choose 2) * 400 individual calculations, but since each takes about 0.002 s on my machine, it shouldn't take much longer than a day with the multiprocessing module.

What's the best way to go about iterating over these columns (with i<j)? I figure I should avoid something like

for i in range(600):
    for j in range(i+1, 600):
        for xi in range(20):
            for xj in range(20):

What is the most numpythonic way of doing this?

Edit: I changed the title since Kendall Tau isn't really important to the question. I realize I could also do something like

import itertools as it
for i, j in it.combinations(xrange(600), 2):
    for xi, xj in product(xrange(20), xrange(20)):

but there's got to be a better, more vectorized way with numpy.

like image 387
wflynny Avatar asked Aug 09 '13 20:08

wflynny


1 Answers

The general way of vectorizing something like this is to use broadcasting to create the cartesian product of the set with itself. In your case you have an array arr of shape (200, 600, 20), so you would take two views of it:

arr_x = arr[:, :, np.newaxis, np.newaxis, :] # shape (200, 600, 1, 1, 20)
arr_y = arr[np.newaxis, np.newaxis, :, :, :] # shape (1, 1, 200, 600, 20)

The above two lines have been expanded for clarity, but I would normally write the equivalent:

arr_x = arr[:, :, None, None]
arr_y = arr

If you have a vectorized function, f, that did broadcasting on all but the last dimension, you could then do:

out = f(arr[:, :, None, None], arr)

And then out would be an array of shape (200, 600, 200, 600), with out[i, j, k, l] holding the value of f(arr[i, j], arr[k, l]). For instance, if you wanted to compute all the pairwise inner products, you could do:

from numpy.core.umath_tests import inner1d

out = inner1d(arr[:, :, None, None], arr)

Unfortunately scipy.stats.kendalltau is not vectorized like this. According to the docs

"If arrays are not 1-D, they will be flattened to 1-D."

So you cannot go about it like this, and you are going to wind up doing Python nested loops, be it explicitly writing them out, using itertools or disguising it under np.vectorize. That's going to be slow, because of the iteration on Python variables, and because you have a Python function per iteration step, which are both expensive actions.

Do note that, when you can go the vectorized way, there is an obvious drawback: if your function is commutative, i.e. if f(a, b) == f(b, a), then you are doing twice the computations needed. Depending on how expensive your actual computation is, this is very often offset by the increase in speed from not having any Python loops or function calls.

like image 147
Jaime Avatar answered Sep 21 '22 12:09

Jaime