I have an numpy array of size
arr.size = (200, 600, 20).
I want to compute scipy.stats.kendalltau
on every pairwise combination of the last two dimensions. For example:
kendalltau(arr[:, 0, 0], arr[:, 1, 0])
kendalltau(arr[:, 0, 0], arr[:, 1, 1])
kendalltau(arr[:, 0, 0], arr[:, 1, 2])
...
kendalltau(arr[:, 0, 0], arr[:, 2, 0])
kendalltau(arr[:, 0, 0], arr[:, 2, 1])
kendalltau(arr[:, 0, 0], arr[:, 2, 2])
...
...
kendalltau(arr[:, 598, 20], arr[:, 599, 20])
such that I cover all combinations of arr[:, i, xi]
with arr[:, j, xj]
with i < j
and xi in [0,20)
, xj in [0, 20)
. This is (600 choose 2) * 400
individual calculations, but since each takes about 0.002 s
on my machine, it shouldn't take much longer than a day with the multiprocessing module.
What's the best way to go about iterating over these columns (with i<j
)? I figure I should avoid something like
for i in range(600):
for j in range(i+1, 600):
for xi in range(20):
for xj in range(20):
What is the most numpythonic way of doing this?
Edit: I changed the title since Kendall Tau isn't really important to the question. I realize I could also do something like
import itertools as it
for i, j in it.combinations(xrange(600), 2):
for xi, xj in product(xrange(20), xrange(20)):
but there's got to be a better, more vectorized way with numpy.
The general way of vectorizing something like this is to use broadcasting to create the cartesian product of the set with itself. In your case you have an array arr
of shape (200, 600, 20)
, so you would take two views of it:
arr_x = arr[:, :, np.newaxis, np.newaxis, :] # shape (200, 600, 1, 1, 20)
arr_y = arr[np.newaxis, np.newaxis, :, :, :] # shape (1, 1, 200, 600, 20)
The above two lines have been expanded for clarity, but I would normally write the equivalent:
arr_x = arr[:, :, None, None]
arr_y = arr
If you have a vectorized function, f
, that did broadcasting on all but the last dimension, you could then do:
out = f(arr[:, :, None, None], arr)
And then out
would be an array of shape (200, 600, 200, 600)
, with out[i, j, k, l]
holding the value of f(arr[i, j], arr[k, l])
. For instance, if you wanted to compute all the pairwise inner products, you could do:
from numpy.core.umath_tests import inner1d
out = inner1d(arr[:, :, None, None], arr)
Unfortunately scipy.stats.kendalltau
is not vectorized like this. According to the docs
"If arrays are not 1-D, they will be flattened to 1-D."
So you cannot go about it like this, and you are going to wind up doing Python nested loops, be it explicitly writing them out, using itertools
or disguising it under np.vectorize
. That's going to be slow, because of the iteration on Python variables, and because you have a Python function per iteration step, which are both expensive actions.
Do note that, when you can go the vectorized way, there is an obvious drawback: if your function is commutative, i.e. if f(a, b) == f(b, a)
, then you are doing twice the computations needed. Depending on how expensive your actual computation is, this is very often offset by the increase in speed from not having any Python loops or function calls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With