Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast weighted euclidean distance between points in arrays

I need to efficiently calculate the euclidean weighted distances for every x,y point in a given array to every other x,y point in another array. This is the code I have which works as expected:

import numpy as np
import random

def rand_data(integ):
    '''
    Function that generates 'integ' random values between [0.,1.)
    '''
    rand_dat = [random.random() for _ in range(integ)]

    return rand_dat

def weighted_dist(indx, x_coo, y_coo):
    '''
    Function that calculates *weighted* euclidean distances.
    '''
    dist_point_list = []
    # Iterate through every point in array_2.
    for indx2, x_coo2 in enumerate(array_2[0]):
        y_coo2 = array_2[1][indx2]
        # Weighted distance in x.
        x_dist_weight = (x_coo-x_coo2)/w_data[0][indx] 
        # Weighted distance in y.
        y_dist_weight = (y_coo-y_coo2)/w_data[1][indx] 
        # Weighted distance between point from array_1 passed and this point
        # from array_2.
        dist = np.sqrt(x_dist_weight**2 + y_dist_weight**2)
        # Append weighted distance value to list.
        dist_point_list.append(round(dist, 8))

    return dist_point_list


# Generate random x,y data points.
array_1 = np.array([rand_data(10), rand_data(10)], dtype=float)

# Generate weights for each x,y coord for points in array_1.
w_data = np.array([rand_data(10), rand_data(10)], dtype=float)

# Generate second larger array.
array_2 = np.array([rand_data(100), rand_data(100)], dtype=float)


# Obtain *weighted* distances for every point in array_1 to every point in array_2.
dist = []
# Iterate through every point in array_1.
for indx, x_coo in enumerate(array_1[0]):
    y_coo = array_1[1][indx]
    # Call function to get weighted distances for this point to every point in
    # array_2.
    dist.append(weighted_dist(indx, x_coo, y_coo))

The final list dist holds as many sub-lists as points are in the first array with as many elements in each as points are in the second one (the weighted distances).

I'd like to know if there's a way to make this code more efficient, perhaps using the cdist function, because this process becomes quite expensive when the arrays have lots of elements (which in my case they have) and when I have to check the distances for lots of arrays (which I also have)

like image 409
Gabriel Avatar asked Jan 12 '23 04:01

Gabriel


1 Answers

@Evan and @Martinis Group are on the right track - to expand on Evan's answer, here's a function that uses broadcasting to quickly calculate the n-dimensional weighted euclidean distance without Python loops:

import numpy as np

def fast_wdist(A, B, W):
    """
    Compute the weighted euclidean distance between two arrays of points:

    D{i,j} = 
    sqrt( ((A{0,i}-B{0,j})/W{0,i})^2 + ... + ((A{k,i}-B{k,j})/W{k,i})^2 )

    inputs:
        A is an (k, m) array of coordinates
        B is an (k, n) array of coordinates
        W is an (k, m) array of weights

    returns:
        D is an (m, n) array of weighted euclidean distances
    """

    # compute the differences and apply the weights in one go using
    # broadcasting jujitsu. the result is (n, k, m)
    wdiff = (A[np.newaxis,...] - B[np.newaxis,...].T) / W[np.newaxis,...]

    # square and sum over the second axis, take the sqrt and transpose. the
    # result is an (m, n) array of weighted euclidean distances
    D = np.sqrt((wdiff*wdiff).sum(1)).T

    return D

To check that this works OK, we'll compare it to a slower version that uses nested Python loops:

def slow_wdist(A, B, W):

    k,m = A.shape
    _,n = B.shape
    D = np.zeros((m, n))

    for ii in xrange(m):
        for jj in xrange(n):
            wdiff = (A[:,ii] - B[:,jj]) / W[:,ii]
            D[ii,jj] = np.sqrt((wdiff**2).sum())
    return D

First, let's make sure that the two functions give the same answer:

# make some random points and weights
def setup(k=2, m=100, n=300):
    return np.random.randn(k,m), np.random.randn(k,n),np.random.randn(k,m)

a, b, w = setup()
d0 = slow_wdist(a, b, w)
d1 = fast_wdist(a, b, w)

print np.allclose(d0, d1)
# True

Needless to say, the version that uses broadcasting rather than Python loops is several orders of magnitude faster:

%%timeit a, b, w = setup()
slow_wdist(a, b, w)
# 1 loops, best of 3: 647 ms per loop

%%timeit a, b, w = setup()
fast_wdist(a, b, w)
# 1000 loops, best of 3: 620 us per loop
like image 141
ali_m Avatar answered Jan 23 '23 01:01

ali_m