For a set of observations:
[a1,a2,a3,a4,a5]
their pairwise distances
d=[[0,a12,a13,a14,a15]
[a21,0,a23,a24,a25]
[a31,a32,0,a34,a35]
[a41,a42,a43,0,a45]
[a51,a52,a53,a54,0]]
Are given in a condensed matrix form (upper triangular of the above, calculated from scipy.spatial.distance.pdist
):
c=[a12,a13,a14,a15,a23,a24,a25,a34,a35,a45]
The question is, given that I have the index in the condensed matrix is there a function (in python preferably) f to quickly give which two observations were used to calculate them?
f(c,0)=(1,2)
f(c,5)=(2,4)
f(c,9)=(4,5)
...
I have tried some solutions but none worth mentioning :(
To complete the list of answers to this question: A fast, vectorized version of fgreggs answer (as suggested by David Marx) could look like this:
def vec_row_col(d,i):
i = np.array(i)
b = 1 - 2 * d
x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)
y = (i + x*(b + x + 2)/2 + 1).astype(int)
if i.shape:
return zip(x,y)
else:
return (x,y)
I needed to do these calculations for huge arrays, and the speedup as compared to the un-vectorized version (https://stackoverflow.com/a/14839010/3631440) is (as usual) quite impressive (using IPython %timeit):
import numpy as np
from scipy.spatial import distance
test = np.random.rand(1000,1000)
condense = distance.pdist(test)
sample = np.random.randint(0,len(condense), 1000)
%timeit res = vec_row_col(1000, sample)
10000 loops, best of 3: 156 µs per loop
res = []
%timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
100 loops, best of 3: 5.87 ms per loop
That's about 37 times faster in this example!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With