I have a df that I got after implementing k-means clustering on my original dataset. I have 4 different clusters here and what I would like to know is how much is the variation of the 4 variables (V1 to V4) in each cluster. In other words, what variation in those 4 variables is causing the clusters being separated.
fit <- kmeans(df, 4, iter.max=1000, nstart=25)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(df, col=fit$clust, pch=16)
aggregate(df, by=list(fit$cluster), FUN=mean)
clust.out <- fit$cluster
df1 <- data.frame(df, fit$cluster)
Here is my df1 after k-means
+-------+-------+-------+--------+--------+-------------+
| ID | V1 | V2 | V3 | V4 | fit.cluster |
+-------+-------+-------+--------+--------+-------------+
| DJ123 | 0.5 | 0.7 | -0.4 | -0.1 | 1 |
| DJ123 | 0.46 | 0.68 | -0.39 | -0.09 | 1 |
| DJ123 | 0.77 | 0.9 | -0.4 | -0.4 | 2 |
| DJ123 | 11.23 | 11.11 | -11.21 | -11.21 | 4 |
| DJ123 | 1.5 | 1.7 | -1.4 | -5.1 | 3 |
| DJ123 | 0.76 | 0.9 | -0.4 | -0.4 | 2 |
| DJ123 | 1.5 | 2.7 | -1.4 | -4.1 | 3 |
+-------+-------+-------+--------+--------+-------------+
Could you please provide a sample code to get the summary statistics within clusters? I hope my question was clear.
You can use ddply from plyr to do this easily.
library(plyr)
ddply(df,.(cluster),summarise,variance1 = var(V1),variance2 = var(V2),mean1 = mean(V1),...)
You can also do it this way,
ddply(df,.(cluster),function(x){
res = c(as.numeric(colwise(var)(x)),as.numeric(colwise(mean)(x)))
names(res) = paste0(rep(c('Var','Mean'),each = 4),rep(1:4,2))
res
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With