Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently calculating weighted distance in MATLAB

Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.

I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:

  1. On any given call, the distance metric can be euclidean OR city-block.
  2. The dimensions of the data are weighted.

As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.

Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).

I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.

function [D] = pairdist1(A, B, wts, distancemetric)

% get some information about the data
    numA = size(A,1);
    numB = size(B,1);

    if strcmp(distancemetric,'cityblock')
        r=1;
    elseif strcmp(distancemetric,'euclidean')
        r=2;
    else error('Function only accepts "cityblock" and "euclidean" distance')
    end

%   format weights for multiplication
    wts = repmat(wts,[numA,1,numB]);

%   get featural differences between A and B pairs
    A = repmat(A,[1 1 numB]);
    B = repmat(permute(B,[3,2,1]),[numA,1,1]);
    differences = abs(A-B).^r;

%   weigh difference values before combining them
    differences = differences.*wts;
    differences = differences.^(1/r);

%   combine features to get distance
    D = permute(sum(differences,2),[1,3,2]);
end

AND:

function [D] = pairdist2(A, B, wts, distancemetric)

% get some information about the data
    numA = size(A,1);
    numB = size(B,1);

    if strcmp(distancemetric,'cityblock')
        r=1;
    elseif strcmp(distancemetric,'euclidean')
        r=2;
    else error('Function only accepts "cityblock" and "euclidean" distance')
    end

%   use for-loops to generate differences
    D = zeros(numA,numB);
    for i=1:numA
        for j=1:numB
            differences = abs(A(i,:) - B(j,:)).^(1/r);
            differences = differences.*wts;
            differences = differences.^(1/r);    
            D(i,j) = sum(differences,2);
        end
    end
end

Here are the performance tests:

A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';


tic
D1 = pairdist1(A,B,wts,distancemetric);
toc

tic
D2 = pairdist2(A,B,wts,distancemetric);
toc

Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.

Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!

EDIT

@Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:

comparison

As the data become larger, the bsxfun version becomes the clear winner!

EDIT #2

I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took @Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.

like image 737
Nolan Conaway Avatar asked Oct 20 '15 18:10

Nolan Conaway


People also ask

How does Matlab calculate distance?

Description. d = distance( site1,site2 ) returns the distance in meters between site1 and site2 . d = distance( site1,site2 , path ) returns the distance using a specified path type, either a Euclidean or great circle path.

What does pdist2 do in Matlab?

The generated code of pdist2 uses parfor (MATLAB Coder) to create loops that run in parallel on supported shared-memory multicore platforms in the generated code.

How does Matlab calculate Manhattan distance?

Z = mandist( W , P ) takes an S -by- R weight matrix, W , and an R -by- Q matrix of Q input (column) vectors, P , and returns the S -by- Q matrix of vector distances, Z . mandist is the Manhattan distance weight function. Weight functions apply weights to an input to get weighted inputs.

How do you find the Euclidean distance between two points in Matlab?

dist() can calculate the Euclidean distance of multiple points at once, it can certainly be used to calculate the distance for two points, although it seems to be an over-kill because the equation sqrt((x1-x2)^2+(y1-y2)^2) can do that too.


1 Answers

You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:

function D = pairdist1(A, B, wts, distancemetric)

    if strcmp(distancemetric,'cityblock')
        r=1;
    elseif strcmp(distancemetric,'euclidean')
        r=2;
    else
        error('Function only accepts "cityblock" and "euclidean" distance')
    end

    differences  = abs(bsxfun(@minus, A, permute(B, [3 2 1]))).^r;
    differences = bsxfun(@times, differences, wts).^(1/r);
    D = permute(sum(differences,2),[1,3,2]);

end
like image 177
Luis Mendo Avatar answered Oct 12 '22 20:10

Luis Mendo