I'm using the mclust library for R ( http://www.stat.washington.edu/mclust ) to do some experimental EM-based GMM clustering. The package is great and seems to generally find very good clusters for my data.
The problem is that I don't really know R at all, and while I have managed to muddle through the clustering process based on the help() contents and the extensive readme, I cannot for the life of me figure out how to write out the actual cluster results to file. I am using the following absurdly simple script to perform the clustering,
myData <- read.csv("data.csv", sep=",", header=FALSE)
attach(myData)
myBIC <- mclustBIC(myData)
mySummary <- summary( myBIC, data=myData )
at which point I have cluster results and a summary. The data in data.csv is just a list of multi-dimensional points, one per line. So each line looks like 'x,y,z' (in the case of 3 dimensions).
If I use 2d points (e.g. just the x and y vals) I can then use the internal plot function to get a very pretty graph that plots the original points and color codes each point based on the cluster it was assigned to. So I know all the info is somewhere in 'myBIC', but the docs and help don't seem to provide any insight as to how to print out this data!
I want to print out a new file based on the results I believe are encoded in myBIC. Something like,
CLUST x, y, z
1 1.2, 3.4, 5.2
1 1.2, 3.3, 5.2
2 5.5, 1.3, 1.3
3 7.1, 1.2, -1.0
3 7.2, 1.2, -1.1
and then - hopefully - also print out the parameters/centroids of the individual gaussians/clusters that the clustering process found.
Surely this is an absurdly easy thing to do and I'm just too ignorant of R to figure it out...
EDIT: I seem to have gotten a little bit further along. Doing the following prints out a somewhat cryptic matrix,
> mySummary$classification
[1] 1 1 2 1 3
[6] 1 1 1 3 1
[12] 1 2 1 3 1
[18] 1 3
which upon reflection I realized is actually the list of samples and their classifications. I guess it is not possible to write this directly via the write command, but a bit more experimentation in the R console lead me to realize that I can do this:
> newData <- mySummary$classification
> write( newData, file="class.csv" )
and that the result actually looks pretty nice!
$ head class.csv
"","x"
"1",1
"2",2
"3",2
where the first column apparenly matches the index for the input data, and the second column describes the assigned class identity.
The 'mySummary$parameters' object appears to be nested though, and has a bunch of sub-objects corresponding to the individual gaussians and their parameters, etc. The 'write' function fails when I try to just write it out, but individually writing out each sub object name is a bit tedious. Which leads me to a new question: how do I iterate over a nested object in R and print the elements out in a serial fashion to a file descriptor?
I have this 'mySummary$parameters' object. It is composed of several sub-objects like 'mySummary$parameters$variance$sigma', etc. I would like to just iterate over everything and print it all to file in the same way that this is done to the CLI automatically...
To calculate the actual clustering parameters themselves (mean, variance, what cluster each point belongs to), you need to use Mclust
.
To do the writing you can use (for example) write.csv
.
By default Mclust
calculates the parameters based on the most optimal model as determined by BIC, so if that's what you want to do, you can do:
myMclust <- Mclust(myData)
Then myMclust$BIC
will contain the results for all the other models (ie myMclust$BIC
is more-or-less the same as mclustBIC(myData)
).
See ?Mclust
in the Value:
section to see what other information myMclust
has. For example, myMclust$parameters$mean
is the mean for each cluster, myMclust$parameters$variance
the variance for each cluster, ...
However myMclust$classification
will contain which cluster each point belongs to, calculated for the most optimal model.
So, to get the output you want, you can do:
# create some data for example purposes -- you have your read.csv(...) instead.
myData <- data.frame(x=runif(100),y=runif(100),z=runif(100))
# get parameters for most optimal model
myMclust <- Mclust(myData)
# if you wanted to do your summary like before:
mySummary <- summary( myMclust$BIC, data=myData )
# add a column in myData CLUST with the cluster.
myData$CLUST <- myMclust$classification
# now to write it out:
write.csv(myData[,c("CLUST","x","y","z")], # reorder columns to put CLUST first
file="out.csv", # output filename
row.names=FALSE, # don't save the row numbers
quote=FALSE) # don't surround column names in ""
A note on the write.csv
- if you don't put in row.names=FALSE
you'll get an extra column in your csv containing the row number. Also, quote=FALSE
puts your column headings as CLUST,x,y,z
whereas otherwise they'd be "CLUST","x","y","z"
. It's your choice.
Suppose we wanted to do the same, but use the parameters from a different model that was not optimal. However, Mclust
calculates parameters only for the optimal model by default. To calculate parameters for a particular model (say "EEI"
), you'd do:
myMclust <- Mclust(myData,modelNames="EEI")
and then proceed as before.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With