Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple approach to assigning clusters for new data after k-means clustering

Tags:

r

k-means

I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.

I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):

df1 <- data.frame(x=runif(100), y=runif(100)) df2 <- data.frame(x=runif(100), y=runif(100)) km <- kmeans(df1, centers=3) closest.cluster <- function(x) {   cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))   return(which.min(cluster.dist)[1]) } clusters2 <- apply(df2, 1, closest.cluster) 

However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?

like image 522
josliber Avatar asked Dec 16 '13 21:12

josliber


People also ask

How does k-means assign cluster?

The working of the K-Means algorithm is explained in the below steps: Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

How will you assign data to existing clustering?

Short answer: Make a classifier where you treat the labels you assigned during clustering as classes. When new points appear, use the classifier you trained using the data you originally clustered, to predict the class the new data have (ie. the cluster they are in).

Are there strategies we can use to get better clusters using k-means?

K-means clustering algorithm can be significantly improved by using a better initialization technique, and by repeating (re-starting) the algorithm. When the data has overlapping clusters, k-means can improve the results of the initialization technique.

What are clustering two main approaches of clustering?

Clustering itself can be categorized into two types viz. Hard Clustering and Soft Clustering. In hard clustering, one data point can belong to one cluster only. But in soft clustering, the output provided is a probability likelihood of a data point belonging to each of the pre-defined numbers of clusters.


2 Answers

You could use the flexclust package, which has an implemented predict method for k-means:

library("flexclust") data("Nclus")  set.seed(1) dat <- as.data.frame(Nclus) ind <- sample(nrow(dat), 50)  dat[["train"]] <- TRUE dat[["train"]][ind] <- FALSE  cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans")) cl1     # # call: # kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4) # # cluster sizes: # #  1   2   3   4  #130 181  98  91   pred_train <- predict(cl1) pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])  image(cl1) points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3) points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange") 

flexclust plot

There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

as.kcca(cl, data=x) # kcca object of family ‘kmeans’  # # call: # as.kcca(object = cl, data = x) # # cluster sizes: # #  1  2  #  50 50  
like image 68
rcs Avatar answered Sep 22 '22 18:09

rcs


Something I noticed about both the approach in the question and the flexclust approaches are that they are rather slow (benchmarked here for a training and testing set with 1 million observations with 2 features each).

Fitting the original model is reasonably fast:

set.seed(144) df1 <- data.frame(x=runif(1e6), y=runif(1e6)) df2 <- data.frame(x=runif(1e6), y=runif(1e6)) system.time(km <- kmeans(df1, centers=3)) #    user  system elapsed  #   1.204   0.077   1.295  

The solution I posted in the question is slow at calculating the test-set cluster assignments, since it separately calls closest.cluster for each test-set point:

system.time(pred.test <- apply(df2, 1, closest.cluster)) #    user  system elapsed  #  42.064   0.251  42.586  

Meanwhile, the flexclust package seems to add a lot of overhead regardless of whether we convert the fitted model with as.kcca or fit a new one ourselves with kcca (though the prediction at the end is much faster)

# APPROACH #1: Convert from the kmeans() output system.time(km.flexclust <- as.kcca(km, data=df1)) #    user  system elapsed  #  87.562   1.216  89.495  system.time(pred.flexclust <- predict(km.flexclust, newdata=df2)) #    user  system elapsed  #   0.182   0.065   0.250   # Approach #2: Fit the k-means clustering model in the flexclust package system.time(km.flexclust2 <- kcca(df1, k=3, kccaFamily("kmeans"))) #    user  system elapsed  # 125.193   7.182 133.519  system.time(pred.flexclust2 <- predict(km.flexclust2, newdata=df2)) #    user  system elapsed  #   0.198   0.084   0.302  

It seems that there is another sensible approach here: using a fast k-nearest neighbors solution like a k-d tree to find the nearest neighbor of each test-set observation within the set of cluster centroids. This can be written compactly and is relatively speedy:

library(FNN) system.time(pred.knn <- get.knnx(km$center, df2, 1)$nn.index[,1]) #    user  system elapsed  #   0.315   0.013   0.345  all(pred.test == pred.knn) # [1] TRUE 
like image 23
josliber Avatar answered Sep 21 '22 18:09

josliber