Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cluster points and plot

I am trying to use clustering in R. I am a rookie and havent worked much with R.

I have the geo location points as latitude and longitude values. What I am looking to do is to find out the hotspots using this data.

I am looking to create clusters of 4 or more points that are 600 feet apart.

I want to get the centroids of such clusters and plot them.

The data looks like this:

LATITUDE    LONGITUD
32.70132    -85.52518
34.74251    -86.88351
32.55205    -87.34777
32.64144    -85.35430
34.92803    -87.81506
32.38016    -86.29790
32.42127    -87.08690
...

structure(list(LATITUDE = c(32.70132, 34.74251, 32.55205, 32.64144, 
34.92803, 32.38016, 32.42127, 32.9095, 33.58092, 32.51617, 33.5726, 
33.83251, 34.65639, 34.27694, 33.73851, 33.95132, 31.35445, 34.05263, 
33.37959, 30.50248, 32.31561, 32.66919, 31.75039, 33.56986, 33.27091, 
33.93598, 32.30964, 31.09773, 32.26711, 33.54263, 34.72014, 34.78548, 
30.65705, 31.25939, 31.27647, 30.54322, 31.22416, 33.38549, 33.18338, 
31.16811, 32.38368, 32.36253, 31.14464), LONGITUD = c(-85.52518, 
-86.88351, -87.34777, -85.3543, -87.81506, -86.2979, -87.0869, 
-85.75888, -86.27647, -86.21179, -86.65275, -87.2696, -85.72738, 
-87.71489, -86.48934, -86.29693, -88.22943, -87.55328, -85.31454, 
-87.79342, -86.88108, -86.26669, -88.04425, -86.44631, -87.74383, 
-87.72403, -86.28067, -85.4449, -87.62541, -86.56251, -86.48971, 
-85.59656, -88.24491, -86.60828, -86.18112, -88.22778, -85.63784, 
-86.03297, -87.55456, -85.37719, -86.38047, -86.21579, -86.86606
)), .Names = c("LATITUDE", "LONGITUD"), class = "data.frame", row.names = c(NA, 
-43L))

There are 30,800 entries (geo locations) in the above data frame. I have given a sample above.

I cannot use K means as it creates the no. of clusters specified but that is not the case here. Clusters should consist of 4 or more points that are within a distance of some 600ft.

Just as an initial step, I tried to plot all the latitude and longitude points and have an idea how the visualization looks like. So that I can use it to check if the plot of clusters formed and this plot look alike.

plot(dbfvar[,1], dbfvar[,2], type="l") #dbfvar is the dataframe having above data.

The plot was not satisfactory. It was not as expected.enter image description here

The main part is to create the clusters and obtain the centroids of them, and visualize the centroids of the clusters formed.

P.S. : I am not confined to using R, I can use python as well. I am looking for a good solution for the above problem before I go ahead and implement it over 7 such files (each of 30,800 geo locations.)

like image 699
user3543477 Avatar asked Oct 24 '14 03:10

user3543477


People also ask

What is a cluster in a plot?

Cluster: A cluster in a scatter plot is a group of points that follow the same general pattern. They could follow a linear pattern or a curved pattern. Clusters can contain many points.

How do you graph a cluster?

Definition. Graph clustering refers to clustering of data in the form of graphs. Two distinct forms of clustering can be performed on graph data. Vertex clustering seeks to cluster the nodes of the graph into groups of densely connected regions based on either edge weights or edge distances.

What is a cluster of data points?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.


1 Answers

Hierarchical clustering is one approach.

First you construct a dendrogram:

dend <- hclust(dist(theData), method="complete")

I am using "complete" linkage here, so that all that the groups are merged by the maximum-distance "rule". This should be useful later if we want to make sure that all of our points in one group are at most a certain distance apart.

I choose the distance of "2" (Because I am not sure how to convert your latitudes and longitudes to feet. You should convert first and then choose 600 instead of 2). Here is the resulting dendrogram with the cutting at height of "2".

plot(dend, hang=-1)
points(c(-100,100), c(2,2), col="red", type="l", lty=2)

dendrogram

Now each subtree intersected by the red line will become one cluster.

groups <- cutree(theData, h=2) # change "h" here to 600 after converting to feet.

We can plot them as a scatter plot to see how they look:

plot(theData, col=groups)

cluster_scatter

Promising. The points nearby form clusters which is what we wanted.

Let's add centers and circles around those centers with the radius of 1 (so that the max distance within the circle is 2):

G1 <- tapply(theData[,1], groups, mean)  # means of groups
G2 <- tapply(theData[,2], groups, mean)  # ...

library(plotrix)  # for drawing circles
plot(theData, col=groups)
points(G1, G2, col= 1:6, cex=2, pch=19)
for(i in 1:length(G1)) {  # draw circles
    draw.circle(G1[i], G2[i], 1, border=i,lty=3,lwd=3)
}

radius

Looks like drawing circles around the mean is not the best way to capture all of the points within the cluster. Nevertheless visually it can be verified that maximum distance between the points in one groups is 2. (just try shifting circles a bit to encapsulate all of the points).

like image 57
Karolis Koncevičius Avatar answered Oct 08 '22 11:10

Karolis Koncevičius