Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most efficient kmeans clustering package in R?

Tags:

r

Sorry if this seems vague, but I have a data set with over 100 columns with characteristics I want to cluster with, and ~10^6 rows. Using

kmeans(dataframe, centers = 100,
             nstart = 20,
             iter.max = 30)

Takes over an hour on an i7-6700K. It does not use multiple cores, so is that something which can be done?

Thanks!

like image 437
Jack Arnestad Avatar asked Nov 10 '17 18:11

Jack Arnestad


People also ask

What package is Kmeans in R?

The R function kmeans() [stats package] can be used to compute k-means algorithm. The simplified format is kmeans(x, centers), where “x” is the data and centers is the number of clusters to be produced.

How do you get the best K in Kmeans?

There is a popular method known as elbow method which is used to determine the optimal value of K to perform the K-Means Clustering Algorithm. The basic idea behind this method is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster.

What is the best choice for number of clusters k?

According to the gap statistic method, k=12 is also determined as the optimal number of clusters (Figure 13). We can visually compare k-Means clusters with k=9 (optimal according to the elbow method) and k=12 (optimal according to the silhouette and gap statistic methods) (see Figure 14).


1 Answers

You could try using ClusterR, especially the function MiniBatchKmeans

Here is an example of usage:

some data (smaller than yours - 300k rows and 30 columns)

z <- rbind(replicate(30, rnorm(1e5, 2)),
           replicate(30, rnorm(1e5, -1)),
           replicate(30, rnorm(1e5, 5)))

library(ClusterR)
km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, 
                         init_fraction = 0.2, initializer = 'kmeans++', early_stop_iter = 10,
                         verbose = F)

pred <- predict_MBatchKMeans(z, km_model$centroids)

object pred contains the associated clusters:

table(pred)
pred
     1      2      3 
100000 100000 100000 

I'd say that was a perfect separation. Increasing the batch size and number of initiations is advisable if the function is fast for you.

Speed:

library(microbenchmark)
microbenchmark(km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, 
                                           init_fraction = 0.2, initializer = 'kmeans++', early_stop_iter = 10,
                                           verbose = F))

Unit: seconds
                                                                                                                                                                                     expr
 km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, init_fraction = 0.2, initializer = "kmeans++",      early_stop_iter = 10, verbose = F)
      min       lq     mean   median       uq      max neval
 3.338328 3.366573 3.473403 3.444095 3.518813 4.176116   100
like image 136
missuse Avatar answered Sep 28 '22 08:09

missuse