I am trying to use clustering in R. I am a rookie and havent worked much with R.
I have the geo location points as latitude and longitude values. What I am looking to do is to find out the hotspots using this data.
I am looking to create clusters of 4 or more points that are 600 feet apart.
I want to get the centroids of such clusters and plot them.
The data looks like this:
LATITUDE LONGITUD
32.70132 -85.52518
34.74251 -86.88351
32.55205 -87.34777
32.64144 -85.35430
34.92803 -87.81506
32.38016 -86.29790
32.42127 -87.08690
...
structure(list(LATITUDE = c(32.70132, 34.74251, 32.55205, 32.64144,
34.92803, 32.38016, 32.42127, 32.9095, 33.58092, 32.51617, 33.5726,
33.83251, 34.65639, 34.27694, 33.73851, 33.95132, 31.35445, 34.05263,
33.37959, 30.50248, 32.31561, 32.66919, 31.75039, 33.56986, 33.27091,
33.93598, 32.30964, 31.09773, 32.26711, 33.54263, 34.72014, 34.78548,
30.65705, 31.25939, 31.27647, 30.54322, 31.22416, 33.38549, 33.18338,
31.16811, 32.38368, 32.36253, 31.14464), LONGITUD = c(-85.52518,
-86.88351, -87.34777, -85.3543, -87.81506, -86.2979, -87.0869,
-85.75888, -86.27647, -86.21179, -86.65275, -87.2696, -85.72738,
-87.71489, -86.48934, -86.29693, -88.22943, -87.55328, -85.31454,
-87.79342, -86.88108, -86.26669, -88.04425, -86.44631, -87.74383,
-87.72403, -86.28067, -85.4449, -87.62541, -86.56251, -86.48971,
-85.59656, -88.24491, -86.60828, -86.18112, -88.22778, -85.63784,
-86.03297, -87.55456, -85.37719, -86.38047, -86.21579, -86.86606
)), .Names = c("LATITUDE", "LONGITUD"), class = "data.frame", row.names = c(NA,
-43L))
There are 30,800 entries (geo locations) in the above data frame. I have given a sample above.
I cannot use K means as it creates the no. of clusters specified but that is not the case here. Clusters should consist of 4 or more points that are within a distance of some 600ft.
Just as an initial step, I tried to plot all the latitude and longitude points and have an idea how the visualization looks like. So that I can use it to check if the plot of clusters formed and this plot look alike.
plot(dbfvar[,1], dbfvar[,2], type="l") #dbfvar is the dataframe having above data.
The plot was not satisfactory. It was not as expected.
The main part is to create the clusters and obtain the centroids of them, and visualize the centroids of the clusters formed.
P.S. : I am not confined to using R, I can use python as well. I am looking for a good solution for the above problem before I go ahead and implement it over 7 such files (each of 30,800 geo locations.)
Cluster: A cluster in a scatter plot is a group of points that follow the same general pattern. They could follow a linear pattern or a curved pattern. Clusters can contain many points.
Definition. Graph clustering refers to clustering of data in the form of graphs. Two distinct forms of clustering can be performed on graph data. Vertex clustering seeks to cluster the nodes of the graph into groups of densely connected regions based on either edge weights or edge distances.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
Hierarchical clustering is one approach.
First you construct a dendrogram:
dend <- hclust(dist(theData), method="complete")
I am using "complete" linkage here, so that all that the groups are merged by the maximum-distance "rule". This should be useful later if we want to make sure that all of our points in one group are at most a certain distance apart.
I choose the distance of "2" (Because I am not sure how to convert your latitudes and longitudes to feet. You should convert first and then choose 600 instead of 2). Here is the resulting dendrogram with the cutting at height of "2".
plot(dend, hang=-1)
points(c(-100,100), c(2,2), col="red", type="l", lty=2)
Now each subtree intersected by the red line will become one cluster.
groups <- cutree(theData, h=2) # change "h" here to 600 after converting to feet.
We can plot them as a scatter plot to see how they look:
plot(theData, col=groups)
Promising. The points nearby form clusters which is what we wanted.
Let's add centers and circles around those centers with the radius of 1 (so that the max distance within the circle is 2):
G1 <- tapply(theData[,1], groups, mean) # means of groups
G2 <- tapply(theData[,2], groups, mean) # ...
library(plotrix) # for drawing circles
plot(theData, col=groups)
points(G1, G2, col= 1:6, cex=2, pch=19)
for(i in 1:length(G1)) { # draw circles
draw.circle(G1[i], G2[i], 1, border=i,lty=3,lwd=3)
}
Looks like drawing circles around the mean is not the best way to capture all of the points within the cluster. Nevertheless visually it can be verified that maximum distance between the points in one groups is 2. (just try shifting circles a bit to encapsulate all of the points).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With