Assume a group of data points, such as one plotted here (this graph isn't specific to my problem, but just used as a suitable example):
Inspecting the scatter graph visually, it's fairly obvious the data points form two 'groups', with some random points that do not obviously belong to either.
I'm looking for an algorithm, that would allow me to:
Definition. Group detection can defined as the clustering of nodes in a graph into groups or communities. This may be a hard partitioning of the nodes, or may allow for overlapping group memberships.
Step 1: Identify the highest and the lowest (least) data values in the given observations. Step 2: Find the difference between the highest and least value. Step 3: Now, assume the number of class intervals we need (usually 5 to 20 classes are suggested to take based the number of observations).
The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in data science.
Grouping of data improves the accuracy/efficiency of estimation. To analyze the frequency distribution table for grouped data when the collected data is large, then we can follow this approach to analyze it easily.
There are many choices, but if you are interested in the probability that a new data point belongs to a particular mixture, I would use a probabilistic approach such as Gaussian mixture modeling either estimated by maximum likelihood or Bayes.
Maximum likelihood estimation of mixtures models is implemented in Matlab.
Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to place a Dirichlet Process prior on the mixture distribution and estimate by some Bayesian method. For instance, see this paper on infinite Gaussian mixture models. The DP mixture model will give you inference over the number of components and the components each elements belong to, which is exactly what you want. Alternatively you could perform model selection on the number of components, but this is generally less elegant.
There are many implementation of DP mixture models models, but they may not be as convenient. For instance, here's a Matlab implementation.
Your graph suggests you are an R user. In that case, if you are looking for prepacked solutions, the answer to your question lies on this Task View for cluster analysis.
I think you are looking for something along the lines of a k-means clustering algorithm.
You should be able to find adequate implementations in most general purpose languages.
You need one of clustering algorithms. All of them can be devided in 2 groups:
If you want algorithm of 1st type then K-Means is what you really need.
If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With