I am a beginner, I am trying to cluster a data frame (with 50,000 records) that has 2 features (x, y) by using mclust
package. However, it feels like forever to run a command (e.g.Mclust(XXX.df)
or densityMclust(XXX.df)
.
Is there any way to execute the command faster? an example code will be helpful.
For your info I'm using 4 core processor with 6GB RAM, it took me about 15 minutes or so to do the same analysis (clustering) with Weka, using R the process is still running above 1.5 hours. I do really want to use R for the analysis.
Dealing with large datasets while using mclust
is described in Technical Report, subsection 11.1.
Briefly, functions Mclust
and mclustBIC
include a provision for using a subsample of the data in the hierarchical clustering phase before applying EM
to the full data set, in order to extend the method to larger datasets.
Generic example:
library(mclust)
set.seed(1)
##
## Data generation
##
N <- 5e3
df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5))
##
## Full set
##
system.time(res <- Mclust(df))
# > user system elapsed
# > 66.432 0.124 67.439
##
## Subset for initial stage
##
M <- 1e3
system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M))))
# > user system elapsed
# > 19.513 0.020 19.546
"Subsetted" version runs approximately 3.5 times faster on my Dual Core (although Mclust
uses only single core).
When N<-5e4
(as in your example) and M<-1e3
it took about 3.5 minutes for version with subset to complete.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With