Sorry if this seems vague, but I have a data set with over 100 columns with characteristics I want to cluster with, and ~10^6 rows. Using <pre class="prettyprint"><code>kmeans(dataframe, centers = 100, nstart = 20, iter.max = 30) </code></pre> Takes over an hour on an i7-6700K. It does not use multiple cores, so is that something which can be done? Thanks!

You could try using ClusterR, especially the function MiniBatchKmeans Here is an example of usage: some data (smaller than yours - 300k rows and 30 columns) <pre class="prettyprint"><code>z <- rbind(replicate(30, rnorm(1e5, 2)), replicate(30, rnorm(1e5, -1)), replicate(30, rnorm(1e5, 5))) library(ClusterR) km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, init_fraction = 0.2, initializer = 'kmeans++', early_stop_iter = 10, verbose = F) pred <- predict_MBatchKMeans(z, km_model$centroids) </code></pre> object <code>pred</code> contains the associated clusters: <pre class="prettyprint"><code>table(pred) pred 1 2 3 100000 100000 100000 </code></pre> I'd say that was a perfect separation. Increasing the batch size and number of initiations is advisable if the function is fast for you. Speed: <pre class="prettyprint"><code>library(microbenchmark) microbenchmark(km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, init_fraction = 0.2, initializer = 'kmeans++', early_stop_iter = 10, verbose = F)) Unit: seconds expr km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, init_fraction = 0.2, initializer = "kmeans++", early_stop_iter = 10, verbose = F) min lq mean median uq max neval 3.338328 3.366573 3.473403 3.444095 3.518813 4.176116 100 </code></pre>

What is the most efficient kmeans clustering package in R?

Tags:

r

Sorry if this seems vague, but I have a data set with over 100 columns with characteristics I want to cluster with, and ~10^6 rows. Using

kmeans(dataframe, centers = 100,
             nstart = 20,
             iter.max = 30)

Takes over an hour on an i7-6700K. It does not use multiple cores, so is that something which can be done?

Thanks!

437

asked Nov 10 '17 18:11

Jack Arnestad

1 Answers

You could try using ClusterR, especially the function MiniBatchKmeans

Here is an example of usage:

some data (smaller than yours - 300k rows and 30 columns)

z <- rbind(replicate(30, rnorm(1e5, 2)),
           replicate(30, rnorm(1e5, -1)),
           replicate(30, rnorm(1e5, 5)))

library(ClusterR)
km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, 
                         init_fraction = 0.2, initializer = 'kmeans++', early_stop_iter = 10,
                         verbose = F)

pred <- predict_MBatchKMeans(z, km_model$centroids)

object pred contains the associated clusters:

table(pred)
pred
     1      2      3 
100000 100000 100000

I'd say that was a perfect separation. Increasing the batch size and number of initiations is advisable if the function is fast for you.

Speed:

library(microbenchmark)
microbenchmark(km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, 
                                           init_fraction = 0.2, initializer = 'kmeans++', early_stop_iter = 10,
                                           verbose = F))

Unit: seconds
                                                                                                                                                                                     expr
 km_model <- MiniBatchKmeans(z, clusters = 3, batch_size = 20, num_init = 5, max_iters = 100, init_fraction = 0.2, initializer = "kmeans++",      early_stop_iter = 10, verbose = F)
      min       lq     mean   median       uq      max neval
 3.338328 3.366573 3.473403 3.444095 3.518813 4.176116   100

136

answered Sep 28 '22 08:09

missuse

Related questions
                            
                                Best practice for defining S3 methods with different arguments [closed]
                            
                                Commenting in rmarkdown/knitr to prevent R evaluation
                            
                                Find rows where one column string is in another column using dplyr in R
                            
                                How do I edit bootstrap theme in rmarkdown?
                            
                                Duplicate and customize secondary y axis
                            
                                how to render DT::datatables in a pdf using rmarkdown?
                            
                                How does the use.cache feature of Packrat work?
                            
                                Reorder factor levels based on another factor
                            
                                R error : "attempt to select less than one element in get1index"
                            
                                R wordcloud2 letterCloud showing only the background
                            
                                plm Package in R - empty model when including only variables without variation over time per individual
                            
                                Tidy evaluation when column names are stored in strings
                            
                                extra variables in legend when wrapping ggplot2 in plotly R
                            
                                Can I run a Shiny app from within R Tools for Visual Studio
                            
                                Plotting sfc_POLYGON in leaflet
                            
                                Handling of closures in data.table
                            
                                How can I find a dataset that has some specific attributes? [duplicate]
                            
                                Understanding TSA::periodogram()
                            
                                Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r
                            
                                how do i control geom_errorbar width by symbol size?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With