Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Expectation Maximization Issue - How to find the optimum number of gaussians within the data

Plot of 2 - Dimensional data

Is there any algorithm or trick of how to determine the number of gaussians which should be identified within a set of data before applying the expectation maximization algorithm?

For example, in the above illustrated plot of 2 - Dimensional data, when I apply the Expectation Maximization algorithm, I try to fit 4 gaussians to the data and I would obtain the following result.

enter image description here

But what if I wouldn't knew the number of gaussians within the data? Is there any algorithm or trick which I could apply so that I could find out this detail?

like image 508
Simon Avatar asked Dec 27 '22 18:12

Simon


2 Answers

This might be a bit of a retread, since others already linked the wiki article of the actual cluster number determination, but I found that article a lil overly dense, so I thought I'd provide a brief, intuitive answer:

Basically, there isn't a universally 'correct' answer for the number of clusters in a data set -- the fewer clusters, the smaller the description length but the higher the variance, and in all non-trivial datasets the variance won't completely go away unless you have a Gaussian for each point, which renders the clustering useless (this is a case of the more general phenomena known as the 'futility of bias free learning': A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances).

So you basically have to pick some feature of your dataset to maximize via the number of clusters (see the wiki article on inductive bias for some example features)

In other sad news, in all such cases finding the number of clusters is known to be NP-hard, so the best you can expect is a good heuristic approach.

like image 177
zergylord Avatar answered Jan 10 '23 20:01

zergylord


Wikipedia has an article on this subject. I am not too familiar with the subject, but I've been told that clustering algorithms that don't require specifying the number of clusters instead need some density information about the clusters or some minimum distance between clusters.

like image 36
David Brown Avatar answered Jan 10 '23 20:01

David Brown