I need help selecting or creating a clustering algorithm according to certain criteria. Imagine you are managing newspaper delivery persons. <ul> <li>You have a set of street addresses, each of which is geocoded.</li> <li>You want to cluster the addresses so that each cluster is assigned to a delivery person.</li> <li>The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.</li> <li>Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)</li> <li>For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.</li> </ul> So... ideas? UPDATE The street network graph, as described in Arachnid's answer, is not available.

I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question. The algorithm works on a list if (x,y) coords <code>ps</code> that are specified as <code>int</code>s. It takes three other parameters as well: <ol> <li>radius (<code>r</code>): given a point, what is the radius for scanning for nearby points</li> <li>max addresses (<code>maxA</code>): what are the maximum number of addresses (points) per cluster?</li> <li>min addresses (<code>minA</code>): minimum addresses per cluster</li> </ol> Set <code>limitA=maxA</code>. Main iteration: Initialize empty list <code>possibleSolutions</code>. Outer iteration: for every point <code>p</code> in <code>ps</code>. Initialize empty list <code>pclusters</code>. A worklist of points <code>wps=copy(ps)</code> is defined. Workpoint <code>wp=p</code>. Inner iteration: while <code>wps</code> is not empty. Remove the point <code>wp</code> in <code>wps</code>. Determine all the points <code>wpsInRadius</code> in <code>wps</code> that are at a distance < <code>r</code> from <code>wp</code>. Sort <code>wpsInRadius</code> ascendingly according to the distance from <code>wp</code>. Keep the first <code>min(limitA, sizeOf(wpsInRadius))</code> points in <code>wpsInRadius</code>. These points form a new cluster (list of points) <code>pcluster</code>. Add <code>pcluster</code> to <code>pclusters</code>. Remove points in <code>pcluster</code> from <code>wps</code>. If <code>wps</code> is not empty, <code>wp=wps[0]</code> and continue inner iteration. End inner iteration. A list of clusters <code>pclusters</code> is obtained. Add this to <code>possibleSolutions</code>. End outer iteration. We have for each <code>p</code> in <code>ps</code> a list of clusters <code>pclusters</code> in <code>possibleSolutions</code>. Every <code>pclusters</code> is then weighted. If <code>avgPC</code> is the average number of points per cluster in <code>possibleSolutions</code> (global) and <code>avgCSize</code> is the average number of clusters per <code>pclusters</code> (global), then this is the function that uses both these variables to determine the weight: <pre class="prettyprint"><code> private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize) { double weight = 0; for (Cluster cluster : pclusters) { int ps = cluster.getPoints().size(); double psAvgPC = ps - avgPC; weight += psAvgPC * psAvgPC / avgCSize; weight += cluster.getSurface() / ps; } return new WeightedPClusters(pclusters, weight); } </code></pre> The best solution is now the <code>pclusters</code> with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with <code>limitA=max(minA,(int)avgPC)</code>. End main iteration. Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved. To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If <code>maxA=minA=16</code>, then we find 2 clusters of 16 addresses. <img src="https://i.stack.imgur.com/L2ymx.jpg" alt="alt text"> (source: paperboyalgorithm at sites.google.com) Next, if we decrease the minimum number of addresses per cluster by setting <code>minA=12</code>, we find 3 clusters of 12/12/8 points. <img src="https://i.stack.imgur.com/i0sgf.jpg" alt="alt text"> (source: paperboyalgorithm at sites.google.com) And to demonstrate that the algorithm is far from perfect, here is the output with <code>maxA=7</code>, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that <code>r</code> here is only 5. <img src="https://i.stack.imgur.com/lsezq.jpg" alt="alt text"> (source: paperboyalgorithm at sites.google.com) Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below. Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part. <img src="https://i.stack.imgur.com/ies85.jpg" alt="alt text"> (source: paperboyalgorithm at sites.google.com) <img src="https://i.stack.imgur.com/bZney.jpg" alt="alt text"> (source: paperboyalgorithm at sites.google.com)

Clustering Algorithm for Paper Boys

Tags:

language-agnostic

algorithm

cluster-analysis

I need help selecting or creating a clustering algorithm according to certain criteria.

Imagine you are managing newspaper delivery persons.

You have a set of street addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.

So... ideas?

UPDATE

The street network graph, as described in Arachnid's answer, is not available.

205

asked Feb 18 '09 21:02

carrier

1 Answers

I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.

The algorithm works on a list if (x,y) coords ps that are specified as ints. It takes three other parameters as well:

radius (r): given a point, what is the radius for scanning for nearby points
max addresses (maxA): what are the maximum number of addresses (points) per cluster?
min addresses (minA): minimum addresses per cluster

Set limitA=maxA. Main iteration: Initialize empty list possibleSolutions. Outer iteration: for every point p in ps. Initialize empty list pclusters. A worklist of points wps=copy(ps) is defined. Workpoint wp=p. Inner iteration: while wps is not empty. Remove the point wp in wps. Determine all the points wpsInRadius in wps that are at a distance < r from wp. Sort wpsInRadius ascendingly according to the distance from wp. Keep the first min(limitA, sizeOf(wpsInRadius)) points in wpsInRadius. These points form a new cluster (list of points) pcluster. Add pcluster to pclusters. Remove points in pcluster from wps. If wps is not empty, wp=wps[0] and continue inner iteration. End inner iteration. A list of clusters pclusters is obtained. Add this to possibleSolutions. End outer iteration.

We have for each p in ps a list of clusters pclusters in possibleSolutions. Every pclusters is then weighted. If avgPC is the average number of points per cluster in possibleSolutions (global) and avgCSize is the average number of clusters per pclusters (global), then this is the function that uses both these variables to determine the weight:

  private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)   {     double weight = 0;     for (Cluster cluster : pclusters)     {       int ps = cluster.getPoints().size();       double psAvgPC = ps - avgPC;       weight += psAvgPC * psAvgPC / avgCSize;       weight += cluster.getSurface() / ps;     }     return new WeightedPClusters(pclusters, weight);   }

The best solution is now the pclusters with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC). End main iteration.

Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.

To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16, then we find 2 clusters of 16 addresses.

alt text
_{(source: paperboyalgorithm at sites.google.com)}

Next, if we decrease the minimum number of addresses per cluster by setting minA=12, we find 3 clusters of 12/12/8 points.

alt text
_{(source: paperboyalgorithm at sites.google.com)}

And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r here is only 5.

alt text
_{(source: paperboyalgorithm at sites.google.com)}

Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.

Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.

alt text
_{(source: paperboyalgorithm at sites.google.com)}

158

answered Sep 30 '22 22:09

eljenso

Related questions
                            
                                Removing Duplicate Images [closed]
                            
                                What is the idea behind scaling an image using Lanczos?
                            
                                Generating m distinct random numbers in the range [0..n-1]
                            
                                Rush Hour - Solving the game
                            
                                Given an array, can I find in O(n) the longest range, whose endpoints are the greatest values in the range?
                            
                                I do not understand the concept of Non Deterministic Turing Machine [closed]
                            
                                Polygon enclosing a set of points
                            
                                Chord detection algorithms?
                            
                                What is the complexity of the log function?
                            
                                Overriding GetHashCode [duplicate]
                            
                                String similarity algorithms?
                            
                                How can we modify almost any algorithm to have a good best-case running time?
                            
                                Replacing nested if statements
                            
                                Non-Recursive Merge Sort
                            
                                Symmetric Bijective Algorithm for Integers
                            
                                C++ string::find complexity
                            
                                Chess game in JavaScript [closed]
                            
                                Explanation of Algorithm for finding articulation points or cut vertices of a graph
                            
                                How to find pythagorean triplets in an array faster than O(N^2)?
                            
                                Hot content algorithm / score with time decay

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Clustering Algorithm for Paper Boys

Tags:

language-agnostic

algorithm

cluster-analysis

carrier

People also ask

1 Answers

eljenso

Recent Activity

Donate For Us