Assume a group of data points, such as one plotted here (this graph isn't specific to my problem, but just used as a suitable example): <img src="https://upload.wikimedia.org/wikipedia/commons/0/0f/Oldfaithful3.png"> Inspecting the scatter graph visually, it's fairly obvious the data points form two 'groups', with some random points that do not obviously belong to either. I'm looking for an algorithm, that would allow me to: <ul> <li>start with a data set of two or more dimensions.</li> <li>detect such groups from the dataset without prior knowledge on how many (or if any) might be there</li> <li>once the groups have been detected, 'ask' the model of groups, if a new sample point seems to fit to any of the groups</li> </ul>

I think you are looking for something along the lines of a k-means clustering algorithm. You should be able to find adequate implementations in most general purpose languages.

You need one of clustering algorithms. All of them can be devided in 2 groups: <ol> <li>you specify number of groups (clusters) - 2 clusters in your example</li> <li>algorithm try to guess correct number of clusters by itself</li> </ol> If you want algorithm of 1st type then K-Means is what you really need. If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.

Group detection in data sets

3 Answers

There are many choices, but if you are interested in the probability that a new data point belongs to a particular mixture, I would use a probabilistic approach such as Gaussian mixture modeling either estimated by maximum likelihood or Bayes.

Maximum likelihood estimation of mixtures models is implemented in Matlab.

Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to place a Dirichlet Process prior on the mixture distribution and estimate by some Bayesian method. For instance, see this paper on infinite Gaussian mixture models. The DP mixture model will give you inference over the number of components and the components each elements belong to, which is exactly what you want. Alternatively you could perform model selection on the number of components, but this is generally less elegant.

There are many implementation of DP mixture models models, but they may not be as convenient. For instance, here's a Matlab implementation.

Your graph suggests you are an R user. In that case, if you are looking for prepacked solutions, the answer to your question lies on this Task View for cluster analysis.

166

answered Oct 19 '22 10:10

Tristan

I think you are looking for something along the lines of a k-means clustering algorithm.

You should be able to find adequate implementations in most general purpose languages.

answered Oct 19 '22 12:10

ConsultUtah

You need one of clustering algorithms. All of them can be devided in 2 groups:

you specify number of groups (clusters) - 2 clusters in your example
algorithm try to guess correct number of clusters by itself

If you want algorithm of 1st type then K-Means is what you really need.

If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.

answered Oct 19 '22 10:10

Roman

Related questions
                            
                                Permutations of binary number by swapping two bits (not lexicographically)
                            
                                Making a basic algorithm - the more interesting version
                            
                                What is asymptotic complexity of List.Add?
                            
                                Fast Fibonacci computation
                            
                                Segment tree space requirement
                            
                                Parsing DeepDiff result
                            
                                Algorithm to find the intersection of two or more songs
                            
                                More efficient algorithm to find OR of two sets
                            
                                Changing O(n^3) to O(n^2) in JavaScript [duplicate]
                            
                                Distinct n numbers so that sum equals to N
                            
                                How can I re-sort an array in-place to put the even indexed items before the odd?
                            
                                memoize any given recursive function in JavaScript
                            
                                Get the biggest chronological drop, min and max from an array with O(n)
                            
                                Pairing numbers (a,b) in an array such a way that a*2 >=b
                            
                                Could not create cudnn handle: CUDNN STATUS INTERNAL ERROR
                            
                                Why is compare-and-swap (CAS) algorithm a good choice for lock-free synchronization?
                            
                                How to find optimum combination for Cutting Stock Problem using Knapsack
                            
                                How can a transform a polynomial to another coordinate system?
                            
                                Resources about building an RDBMS [closed]
                            
                                What's the most insidious way to pose this problem?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Group detection in data sets

Tags:

algorithm

statistics

probability

feature-detection

Sami

People also ask

3 Answers

Tristan

ConsultUtah

Roman

Recent Activity

Donate For Us