Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Map Clustering Algorithm

Tags:

My current code is pretty quick, but I need to make it even faster so we can accommodate even more markers. Any suggestions?

Notes:

  • The code runs fastest when the SQL statement is ordered by marker name - which itself does a very partial job of clustering the markers (the names of markers in the same location are often, but not always similar).
  • I can't pre-cluster the markers, because they can be dynamically searched and filtered.
  • I've tried grid-based clustering - but the results often aren't very nice.
  • I know that the clusters are slightly skewed on a Mercator projection.
  • I'm not interested in a commercial clustering service.

The code:

$singleMarkers = array();
$clusterMarkers = array();

while (count($markers)) {
    $marker  = array_pop($markers);
    $cluster = array();

    // Compare marker against all remaining markers.
    foreach ($markers as $key => $compareMarker) {
        // This function returns the distance between two markers, at a defined zoom level.
        $pixels = pixelDistance($marker['lat'], $marker['lng'], $compareMarker['lat'], $compareMarker['lng'], $zoomLevel);
        // If two markers are closer than defined distance, remove compareMarker from array and add to cluster.
        if ($pixels < $distance) {
            unset($markers[$key]);
            $cluster[] = $compareMarker;
        }
    }

    // If a marker was added to cluster, also add the marker we were comparing to.
    if (count($cluster) > 0) {
        $cluster[] = $marker;
        $clusterMarkers[] = $cluster;
    } else {
        $singleMarkers[] = $marker;
    }
}

function pixelDistance($lat1, $lon1, $lat2, $lon2, $zoom) {
    $x1 = $lon1*10000000; //This is what I did to compensate for using lat/lon values instead of pixels.
    $y1 = $lat1*10000000;
    $x2 = $lon2*10000000;
    $y2 = $lat2*10000000;

    return sqrt(pow(($x1-$x2),2) + pow(($y1-$y2),2)) >> (21 - $zoom); //21 is the max zoom level
}

UPDATE

Here's the current code:

$singleMarkers = array();
$clusterMarkers = array();

// Minimum distance between markers to be included in a cluster, at diff. zoom levels
$DISTANCE = (10000000 >> $ZOOM) / 100000;

// Loop until all markers have been compared.
while (count($markers)) {
    $marker  = array_pop($markers);
    $cluster = array();

    // Compare against all markers which are left.
    foreach ($markers as $key => $target) {
        $pixels = abs($marker['lat']-$target['lat']) + abs($marker['lng']-$target['lng']);

        // If the two markers are closer than given distance remove target marker from array and add it to cluster.
        if ($pixels < $DISTANCE) {
            unset($markers[$key]);
            $cluster[] = $target;
        }
    }

    // If a marker has been added to cluster, add also the one we were comparing to.
    if (count($cluster) > 0) {
        $cluster[] = $marker;
        $clusterMarkers[] = $cluster;
    } else {
        $singleMarkers[] = $marker;
    }
}
like image 539
Chris B Avatar asked Sep 16 '09 16:09

Chris B


People also ask

What is map clustering?

A cluster or map combines the two stages of brainstorming (recording ideas and then grouping them) into one. It also allows you to see, at a glance, the aspects of the subject about which you have the most to say, so it can help you choose how to focus a broad subject for writing.

How is clustering represented on a map?

Each cluster is represented by a circle with a diameter that is directly proportional to the number of markers it represents which, in turn, is also numerically represented. Clicking on a cluster will show you the individual markers or sub-clusters.

What are the algorithms used for clustering?

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.

What is a cluster map in statistics?

Cluster mapping creates a dataset on the presence of clusters across geographies, based on a standardized set of benchmark cluster definitions that group individual industries uniquely into cluster categories.


2 Answers

Do you actually need to calculate the Euclidean distance? If you are just comparing relative magnitudes of distances, you can probably get away with using the Manhattan distance, which will save you two calls to pow() and one to sqrt():

function pixelDistance($lat1, $lon1, $lat2, $lon2, $zoom) {
    $x1 = $lon1*10000000; //This is what I did to compensate for using lat/lon values instead of pixels.
    $y1 = $lat1*10000000;
    $x2 = $lon2*10000000;
    $y2 = $lat2*10000000;

    return ($x1-$x2) + ($y1-$y2) >> (21 - $zoom);
}

Not sure if you need the >> (21 - $zoom) bit for your calculations, so I left it in. But unless you actually need to use the calculated distance values elsewhere, you can probably get away with just using the latitude/longitude directly (no need to multiply by anything) and taking the Manhattan distance, assuming you pre-calculate $distance to fit in with that measure, which will be a lot cheaper in computational terms than coercing all the distances to fit in with the units and magnitude of $distance.

EDIT: When I was researching this problem last year, I found some useful stuff on Wikipedia - yes, it can happen ;-)

  • Cluster analysis
  • Hierarchical clustering

I can also highly recommend the book Programming Collective Intelligence: Building Smart Web 2.0 Applications which goes into clustering in great depth, as applied not only to geographical data but also to other areas of data analysis.

like image 115
NickFitz Avatar answered Oct 29 '22 20:10

NickFitz


Expanding on what John said, I think you should try inlining that function. Function calls in PHP are very slow, so you should get a decent speedup from that.

like image 26
ryeguy Avatar answered Oct 29 '22 19:10

ryeguy