Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Condensed matrix function to find pairs

For a set of observations:

[a1,a2,a3,a4,a5]

their pairwise distances

d=[[0,a12,a13,a14,a15]
   [a21,0,a23,a24,a25]
   [a31,a32,0,a34,a35]
   [a41,a42,a43,0,a45]
   [a51,a52,a53,a54,0]]

Are given in a condensed matrix form (upper triangular of the above, calculated from scipy.spatial.distance.pdist ):

c=[a12,a13,a14,a15,a23,a24,a25,a34,a35,a45]

The question is, given that I have the index in the condensed matrix is there a function (in python preferably) f to quickly give which two observations were used to calculate them?

f(c,0)=(1,2)
f(c,5)=(2,4)
f(c,9)=(4,5)
...

I have tried some solutions but none worth mentioning :(

like image 275
Ηλίας Avatar asked Mar 16 '11 10:03

Ηλίας


1 Answers

To complete the list of answers to this question: A fast, vectorized version of fgreggs answer (as suggested by David Marx) could look like this:

def vec_row_col(d,i):                                                                
    i = np.array(i)                                                                 
    b = 1 - 2 * d                                                                   
    x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)                                      
    y = (i + x*(b + x + 2)/2 + 1).astype(int)                                                    
    if i.shape:                                                                     
        return zip(x,y)                                                             
    else:                                                                           
        return (x,y) 

I needed to do these calculations for huge arrays, and the speedup as compared to the un-vectorized version (https://stackoverflow.com/a/14839010/3631440) is (as usual) quite impressive (using IPython %timeit):

import numpy as np
from scipy.spatial import distance

test = np.random.rand(1000,1000)
condense = distance.pdist(test)
sample = np.random.randint(0,len(condense), 1000)

%timeit res = vec_row_col(1000, sample)
10000 loops, best of 3: 156 µs per loop

res = []
%timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
100 loops, best of 3: 5.87 ms per loop

That's about 37 times faster in this example!

like image 77
chris-sc Avatar answered Oct 02 '22 12:10

chris-sc