Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient Numpy computation of pairwise squared differences

The following code does exactly what I want, which is to compute the pairwise sum of squares of differences between elements of a vector (length three in the example), of which I have a long series (limited to five here). The desired result is shown at the bottom. But the implementation feels kludgy for two reasons:

1) the need to add a phantom dimension, changing the shape from (5, 3) to (5,1,3) to avoid broadcast problems, and

2) the apparent necessity of an explicit 'for' loop, which I'm sure is why it's taking hours to execute on my much larger data set (a million vectors of length 2904).

Is there a more efficient and/or pythonic way to achieve the same result?

a = np.array([[ 4,  2,  3], [-1, -5,  4], [ 2,  1,  4], [-5, -1,  4], [6, -3,  3]])
a = a.reshape((5,1,3))

m = a.shape[0]
n = a.shape[2]
d = np.zeros((n,n))
for i in range(m):
    c = a[i,:] - np.transpose(a[i,:])
    c = c**2
    d += c

print d

[[   0.  118.  120.]
 [ 118.    0.  152.]
 [ 120.  152.    0.]]
like image 375
Grant Petty Avatar asked Sep 05 '15 16:09

Grant Petty


2 Answers

If you don't mind the dependency on scipy, you can use functions from the scipy.spatial.distance library:

In [17]: from scipy.spatial.distance import pdist, squareform

In [18]: a = np.array([[ 4,  2,  3], [-1, -5,  4], [ 2,  1,  4], [-5, -1,  4], [6, -3,  3]])

In [19]: d = pdist(a.T, metric='sqeuclidean')

In [20]: d
Out[20]: array([ 118.,  120.,  152.])

In [21]: squareform(d)
Out[21]: 
array([[   0.,  118.,  120.],
       [ 118.,    0.,  152.],
       [ 120.,  152.,    0.]])
like image 196
Warren Weckesser Avatar answered Nov 13 '22 03:11

Warren Weckesser


You could eliminate the for-loop by using:

In [48]: ((a - a.swapaxes(1,2))**2).sum(axis=0)
Out[48]: 
array([[  0, 118, 120],
       [118,   0, 152],
       [120, 152,   0]])

Note that if a has shape (N, 1, M) then (a - a.swapaxes(1,2)) has shape (N, M, M). Make sure you have enough RAM to accommodate an array of this size. Page swapping can also slow the calculation to a crawl.

If you do have too little memory, you will have to break up the calculation in chunks:

m, _, n = a.shape
chunksize = 10**4
d = np.zeros((n,n))
for i in range(0, m, chunksize):
    b = a[i:i+chunksize]
    d += ((b - b.swapaxes(1,2))**2).sum(axis=0)

This is a compromise between performing the calculation on the entire array and calculating row-by-row. If there are a million rows, and the chunksize is 10**4, then there will be only 100 iterations of the loop instead of a million. Thus, it should be significantly faster than calculating row-by-row. Choose the largest value of chunksize you can which allows the calculation to be performed in RAM.

like image 41
unutbu Avatar answered Nov 13 '22 02:11

unutbu