Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to run mclust faster on 50000 records dataset

Tags:

r

I am a beginner, I am trying to cluster a data frame (with 50,000 records) that has 2 features (x, y) by using mclust package. However, it feels like forever to run a command (e.g.Mclust(XXX.df) or densityMclust(XXX.df).

Is there any way to execute the command faster? an example code will be helpful.

For your info I'm using 4 core processor with 6GB RAM, it took me about 15 minutes or so to do the same analysis (clustering) with Weka, using R the process is still running above 1.5 hours. I do really want to use R for the analysis.

like image 397
user1389582 Avatar asked Dec 12 '22 18:12

user1389582


1 Answers

Dealing with large datasets while using mclust is described in Technical Report, subsection 11.1.

Briefly, functions Mclust and mclustBIC include a provision for using a subsample of the data in the hierarchical clustering phase before applying EM to the full data set, in order to extend the method to larger datasets.

Generic example:

library(mclust)
set.seed(1)
##
## Data generation
##
N  <- 5e3
df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5))
##
## Full set
##
system.time(res <- Mclust(df))
# >   user  system elapsed 
# > 66.432   0.124  67.439 
##
## Subset for initial stage
##
M <- 1e3
system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M))))
# >   user  system elapsed 
# > 19.513   0.020  19.546

"Subsetted" version runs approximately 3.5 times faster on my Dual Core (although Mclust uses only single core).

When N<-5e4 (as in your example) and M<-1e3 it took about 3.5 minutes for version with subset to complete.

like image 79
redmode Avatar answered Jan 06 '23 22:01

redmode