To make a distance matrix or to repeatedly calculate distance

Question

I'm working on K-medoids algorithm implementation. It is a clustering algorithm and one of its steps includes finding the most representative point in a cluster.

So, here's the thing

I have a certain number of clusters
Each cluster contains a certain number of points
I need to find the point in each cluster that results with the least error if it is picked as a cluster representative
Distance from each point to all the other in the cluster needs to be calculated
This distance calculation could be simple as Euclidean or more complex like DTW (Dynamic Time Warping) between two signals

There are two approaches, one is to calculate distance matrix that will save values between all the points in the dataset and the other is to calculate distances during clustering, which results that distances between some points will be calculated repeatedly.

On one hand, to build distance matrix you must calculate distances between all points in the whole dataset and some of calculated values will never be used.

On the other hand, if you don't build the distance matrix, you will repeat some calculations in certain number of iterations.

Which is the better approach?

I'm also considering MapReduce implementation, so opinions from that angle are also welcome.

Thanks

amit · Accepted Answer

A 3rd approach could be a combination of both, and is lazily evaluating the distance matrix. Initialize a matrix with default values (unrealistic values, like negative ones), and when you need to calculate distance between two points, if the values is already present in the matrix - just take it from it. Otherwise, calculate it and store it in the matrix.

This approach trades calculations (and is optimal in doing the lowest number of possible pair calculations), for more branches in the code, and a few more instructions. However, due to branch predictors, I assume this overhead will not be that dramatic.
I predict it will have better performance when the calculation is relatively expansive.

Another optimization of it could be to dynamically switch for a plain matrix implementation (and calculate the remaining part of the matrix) when the number of already calculated exceeds a certain threshold. This can be achieved pretty nicely in OOP languages, by switching the implementation of the interface when a certain threshold is met.

Which is actually better implementation is going to rely heavily on the cost of the distance function, and the data you are clustering, as some will need to calculate the same points more often than other data sets.
I suggest doing a benchmark, and using statistical tools to evaluate which method is actually better.

To make a distance matrix or to repeatedly calculate distance

Tags:

algorithm

hadoop

mapreduce

Kobe-Wan Kenobi

1 Answers

amit

Recent Activity

Donate For Us

To make a distance matrix or to repeatedly calculate distance

Tags:

algorithm

hadoop

mapreduce

Kobe-Wan Kenobi

1 Answers

amit

Related questions

Recent Activity

Donate For Us