Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria.

Imagine you are managing newspaper delivery persons.

  • You have a set of street addresses, each of which is geocoded.
  • You want to cluster the addresses so that each cluster is assigned to a delivery person.
  • The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
  • Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
  • For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.

So... ideas?

UPDATE

The street network graph, as described in Arachnid's answer, is not available.

like image 205
carrier Avatar asked Feb 18 '09 21:02

carrier


People also ask

Which clustering algorithm is fastest?

If it is well-separated clusters, then k-means is the fastest. If it is overlapping dataset, then efficiency and effectiveness are both important, thus fuzzy clustering methods are recommended solutions.

Which is best clustering algorithm and why?

The DBSCAN is better than other cluster algorithms because it does not require a pre-set number of clusters. It identifies outliers as noise, unlike the Mean-Shift method that forces such points into the cluster in spite of having different characteristics. It finds arbitrarily shaped and sized clusters quite well.


1 Answers

I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.

The algorithm works on a list if (x,y) coords ps that are specified as ints. It takes three other parameters as well:

  1. radius (r): given a point, what is the radius for scanning for nearby points
  2. max addresses (maxA): what are the maximum number of addresses (points) per cluster?
  3. min addresses (minA): minimum addresses per cluster

Set limitA=maxA. Main iteration: Initialize empty list possibleSolutions. Outer iteration: for every point p in ps. Initialize empty list pclusters. A worklist of points wps=copy(ps) is defined. Workpoint wp=p. Inner iteration: while wps is not empty. Remove the point wp in wps. Determine all the points wpsInRadius in wps that are at a distance < r from wp. Sort wpsInRadius ascendingly according to the distance from wp. Keep the first min(limitA, sizeOf(wpsInRadius)) points in wpsInRadius. These points form a new cluster (list of points) pcluster. Add pcluster to pclusters. Remove points in pcluster from wps. If wps is not empty, wp=wps[0] and continue inner iteration. End inner iteration. A list of clusters pclusters is obtained. Add this to possibleSolutions. End outer iteration.

We have for each p in ps a list of clusters pclusters in possibleSolutions. Every pclusters is then weighted. If avgPC is the average number of points per cluster in possibleSolutions (global) and avgCSize is the average number of clusters per pclusters (global), then this is the function that uses both these variables to determine the weight:

  private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)   {     double weight = 0;     for (Cluster cluster : pclusters)     {       int ps = cluster.getPoints().size();       double psAvgPC = ps - avgPC;       weight += psAvgPC * psAvgPC / avgCSize;       weight += cluster.getSurface() / ps;     }     return new WeightedPClusters(pclusters, weight);   } 

The best solution is now the pclusters with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC). End main iteration.

Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.

To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16, then we find 2 clusters of 16 addresses.

alt text
(source: paperboyalgorithm at sites.google.com)

Next, if we decrease the minimum number of addresses per cluster by setting minA=12, we find 3 clusters of 12/12/8 points.

alt text
(source: paperboyalgorithm at sites.google.com)

And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r here is only 5.

alt text
(source: paperboyalgorithm at sites.google.com)

Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.

Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.

alt text
(source: paperboyalgorithm at sites.google.com)

alt text
(source: paperboyalgorithm at sites.google.com)

like image 158
eljenso Avatar answered Sep 30 '22 22:09

eljenso