I am using dplyr and broom to compute kmeans for my data. My data contains a test and training set of X and Y coordinates and are grouped by a some parameter value (lambda in this case):
mds.test = data.frame()
for(l in seq(0.1, 0.9, by=0.2)) {
new.dist <- run.distance.model(x, y, lambda=l)
mds <- preform.mds(new.dist, ndim=2)
mds.test <- rbind(mds.test, cbind(mds$space, design[,c(1,3,4,5)], lambda=rep(l, nrow(mds$space)), data="test"))
}
> head(mds.test)
Comp1 Comp2 Transcripts Genes Timepoint Run lambda data
7A_0_AAGCCTAGCGAC -0.06690476 -0.25519106 68125 9324 Day 0 7A 0.1 test
7A_0_AAATGACTGGCC -0.15292848 0.04310200 28443 6746 Day 0 7A 0.1 test
7A_0_CATCTCGTTCTA -0.12529445 0.13022908 27360 6318 Day 0 7A 0.1 test
7A_0_ACCGGCACATTC -0.33015913 0.14647857 23038 5709 Day 0 7A 0.1 test
7A_0_TATGTCGGAATG -0.25826098 0.05424976 22414 5878 Day 0 7A 0.1 test
7A_0_GAAAAAGGTGAT -0.24349387 0.08071162 21907 6766 Day 0 7A 0.1 test
I've head
the test dataset above but I also have one named mds.train
which contains my training data coordinates. My ultimate goal here is to run k-means for both sets grouped by lambda, then compute the within.ss, between.ss and total.ss for the test data on the training centers. Thanks to a great resource on broom, I am able to run kmeans for each lambda for the test set by simply doing the following:
test.kclusts = mds.test %>%
group_by(lambda) %>%
do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
Then I can compute the centers of this data for each cluster within each lambda:
test.clusters = test.kclusts %>%
group_by(lambda) %>%
do(tidy(.$kclust[[1]]))
This is where I am stuck. How do I compute the feature assignments as similarly shown on the reference page (e.g. kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], points.matrix))
), where my points.matrix
is mds.test
which is a data.frame with length(unique(mds.test$lambda))
times as many rows as should be? And is there a way to somehow use centers from the training set to compute glance()
statistics based off the test assignments?
Any help would be greatly appreciated! Thank you!
EDIT: Updating progress. I have figured out how to aggregate the test/training assignments but am still having issues trying to compute kmeans statistics from both sets (training assignments on test center and test assignments on training centers). Updated code is below:
test.kclusts = mds.test %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
test.clusters = test.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]]))
test.clusterings = test.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
test.assignments = left_join(test.kclusts, mds.test) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))
train.kclusts = mds.train %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
train.clusters = train.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]]))
train.clusterings = train.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
train.assignments = left_join(train.kclusts, mds.train) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))
test.assignments$data = "test"
train.assignments$data = "train"
merge.assignments = rbind(test.assignments, train.assignments)
merge.assignments %>% filter(., data=='test') %>% group_by(lambda) ... ?
Ive attached a plot below which illustrates my progress to this point. Just to reiterate, I would like to compute kmeans statistics (within sum of square, total sum of squares, and between sum of squares) for the training data centers on test assignments/coordinates (the plots which the centers look off):
One approach would be to...
See below using a simpler dataset pulled from clustering example from tidymodels.
library(tidyverse)
library(rsample)
library(broom)
library(fuzzyjoin)
# data and train / test set-up
set.seed(27)
centers <- tibble(
cluster = factor(1:3),
num_points = c(100, 150, 50), # number points in each cluster
x1 = c(5, 0, -3), # x1 coordinate of cluster center
x2 = c(-1, 1, -2) # x2 coordinate of cluster center
)
labelled_points <-
centers %>%
mutate(
x1 = map2(num_points, x1, rnorm),
x2 = map2(num_points, x2, rnorm)
) %>%
select(-num_points) %>%
unnest(cols = c(x1, x2))
points <-
labelled_points %>%
select(-cluster)
set.seed(1234)
split <- rsample::initial_split(points)
train <- rsample::training(split)
test <- rsample::testing(split)
# Fit kmeans on train then assign clusters to test
kclust <- kmeans(train, centers = 3)
clust_centers <- kclust %>%
tidy() %>%
select(-c(size, withinss))
test_clusts <- fuzzyjoin::distance_join(mutate(test, index = row_number()),
clust_centers,
max_dist = Inf,
method = "euclidean",
distance_col = "dist") %>%
group_by(index) %>%
filter(dist == min(dist)) %>%
ungroup()
#> Joining by: c("x1", "x2")
# resulting table
test_clusts
#> # A tibble: 75 x 7
#> x1.x x2.x index x1.y x2.y cluster dist
#> <dbl> <dbl> <int> <dbl> <dbl> <fct> <dbl>
#> 1 4.24 -0.946 1 5.07 -1.10 3 0.847
#> 2 3.54 0.287 2 5.07 -1.10 3 2.06
#> 3 3.71 -1.67 3 5.07 -1.10 3 1.47
#> 4 5.03 -0.788 4 5.07 -1.10 3 0.317
#> 5 6.57 -2.49 5 5.07 -1.10 3 2.04
#> 6 4.97 0.233 6 5.07 -1.10 3 1.34
#> 7 4.43 -1.89 7 5.07 -1.10 3 1.01
#> 8 5.34 -0.0705 8 5.07 -1.10 3 1.07
#> 9 4.60 0.196 9 5.07 -1.10 3 1.38
#> 10 5.68 -1.55 10 5.07 -1.10 3 0.758
#> # ... with 65 more rows
# calc within clusts SS on test
test_clusts %>%
group_by(cluster) %>%
summarise(size = n(),
withinss = sum(dist^2),
withinss_avg = withinss / size)
#> # A tibble: 3 x 4
#> cluster size withinss withinss_avg
#> <fct> <int> <dbl> <dbl>
#> 1 1 11 32.7 2.97
#> 2 2 35 78.9 2.26
#> 3 3 29 62.0 2.14
# compare to on train
tidy(kclust) %>%
mutate(withinss_avg = withinss / size)
#> # A tibble: 3 x 6
#> x1 x2 size withinss cluster withinss_avg
#> <dbl> <dbl> <int> <dbl> <fct> <dbl>
#> 1 -3.22 -1.91 40 76.8 1 1.92
#> 2 0.0993 1.06 113 220. 2 1.95
#> 3 5.07 -1.10 72 182. 3 2.53
# plot of test and train points
test_clusts %>%
select(x1 = x1.x, x2 = x2.x, cluster) %>%
mutate(type = "test") %>%
bind_rows(
augment(kclust, train) %>%
mutate(type = "train") %>%
rename(cluster = .cluster)
) %>%
ggplot(aes(x = x1,
y = x2,
color = as.factor(cluster)))+
geom_point()+
facet_wrap(~fct_rev(as.factor(type)))+
coord_fixed()+
labs(title = "Cluster Assignment on Training and Holdout Datasets",
color = "Cluster")+
theme_bw()
Created on 2021-08-19 by the reprex package (v2.0.0)
(See comment on OP for link to conversations on making this easier within tidymodels.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With