Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy Broadcasting: Calculating sum of squared differences between two arrays

I have the following code. It is taking forever in Python. There must be a way to translate this calculation into a broadcast...

def euclidean_square(a,b):
    squares = np.zeros((a.shape[0],b.shape[0]))
    for i in range(squares.shape[0]):
        for j in range(squares.shape[1]):
            diff = a[i,:] - b[j,:]
            sqr = diff**2.0
            squares[i,j] = np.sum(sqr)
    return squares
like image 832
Chris Avatar asked Mar 26 '16 22:03

Chris


People also ask

How do you find the difference between two arrays in NumPy?

subtract() function is used when we want to compute the difference of two array.It returns the difference of arr1 and arr2, element-wise. Parameters : arr1 : [array_like or scalar]1st Input array.

Can Python broadcast multiple dimensions?

The arrays can be broadcast together iff they are compatible with all dimensions. After broadcasting, each array behaves as if it had shape equal to the element-wise maximum of shapes of the two input arrays.

What is broadcasting in NumPy explain with example what are different rules in broadcasting?

The term broadcasting refers to the ability of NumPy to treat arrays of different shapes during arithmetic operations. Arithmetic operations on arrays are usually done on corresponding elements. If two arrays are of exactly the same shape, then these operations are smoothly performed.

What is broadcasting in NumPy?

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.


2 Answers

You can use np.einsum after calculating the differences in a broadcasted way, like so -

ab = a[:,None,:] - b
out = np.einsum('ijk,ijk->ij',ab,ab)

Or use scipy's cdist with its optional metric argument set as 'sqeuclidean' to give us the squared euclidean distances as needed for our problem, like so -

from scipy.spatial.distance import cdist
out = cdist(a,b,'sqeuclidean')
like image 145
Divakar Avatar answered Sep 23 '22 23:09

Divakar


I collected the different methods proposed here, and in two other questions, and measured the speed of the different methods:

import numpy as np
import scipy.spatial
import sklearn.metrics

def dist_direct(x, y):
    d = np.expand_dims(x, -2) - y
    return np.sum(np.square(d), axis=-1)

def dist_einsum(x, y):
    d = np.expand_dims(x, -2) - y
    return np.einsum('ijk,ijk->ij', d, d)

def dist_scipy(x, y):
    return scipy.spatial.distance.cdist(x, y, "sqeuclidean")

def dist_sklearn(x, y):
    return sklearn.metrics.pairwise.pairwise_distances(x, y, "sqeuclidean")

def dist_layers(x, y):
    res = np.zeros((x.shape[0], y.shape[0]))
    for i in range(x.shape[1]):
        res += np.subtract.outer(x[:, i], y[:, i])**2
    return res

# inspired by the excellent https://github.com/droyed/eucl_dist
def dist_ext1(x, y):
    nx, p = x.shape
    x_ext = np.empty((nx, 3*p))
    x_ext[:, :p] = 1
    x_ext[:, p:2*p] = x
    x_ext[:, 2*p:] = np.square(x)

    ny = y.shape[0]
    y_ext = np.empty((3*p, ny))
    y_ext[:p] = np.square(y).T
    y_ext[p:2*p] = -2*y.T
    y_ext[2*p:] = 1

    return x_ext.dot(y_ext)

# https://stackoverflow.com/a/47877630/648741
def dist_ext2(x, y):
    return np.einsum('ij,ij->i', x, x)[:,None] + np.einsum('ij,ij->i', y, y) - 2 * x.dot(y.T)

I use timeit to compare the speed of the different methods. For the comparison, I use vectors of length 10, with 100 vectors in the first group, and 1000 vectors in the second group.

import timeit

p = 10
x = np.random.standard_normal((100, p))
y = np.random.standard_normal((1000, p))

for method in dir():
    if not method.startswith("dist_"):
        continue
    t = timeit.timeit(f"{method}(x, y)", number=1000, globals=globals())
    print(f"{method:12} {t:5.2f}ms")

On my laptop, the results are as follows:

dist_direct   5.07ms
dist_einsum   3.43ms
dist_ext1     0.20ms  <-- fastest
dist_ext2     0.35ms
dist_layers   2.82ms
dist_scipy    0.60ms
dist_sklearn  0.67ms

While the two methods dist_ext1 and dist_ext2, both based on the idea of writing (x-y)**2 as x**2 - 2*x*y + y**2, are very fast, there is a downside: When the distance between x and y is very small, due to cancellation error the numerical result can sometimes be (very slightly) negative.

like image 38
jochen Avatar answered Sep 21 '22 23:09

jochen