Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group detection in data sets

Assume a group of data points, such as one plotted here (this graph isn't specific to my problem, but just used as a suitable example):

Inspecting the scatter graph visually, it's fairly obvious the data points form two 'groups', with some random points that do not obviously belong to either.

I'm looking for an algorithm, that would allow me to:

  • start with a data set of two or more dimensions.
  • detect such groups from the dataset without prior knowledge on how many (or if any) might be there
  • once the groups have been detected, 'ask' the model of groups, if a new sample point seems to fit to any of the groups
like image 888
Sami Avatar asked Jan 12 '10 20:01

Sami


People also ask

What is group detection?

Definition. Group detection can defined as the clustering of nodes in a graph into groups or communities. This may be a hard partitioning of the nodes, or may allow for overlapping group memberships.

How do you identify a group of data?

Step 1: Identify the highest and the lowest (least) data values in the given observations. Step 2: Find the difference between the highest and least value. Step 3: Now, assume the number of class intervals we need (usually 5 to 20 classes are suggested to take based the number of observations).

Is a method to discover grouping in a dataset?

The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in data science.

Why is group by used in data analysis?

Grouping of data improves the accuracy/efficiency of estimation. To analyze the frequency distribution table for grouped data when the collected data is large, then we can follow this approach to analyze it easily.


3 Answers

There are many choices, but if you are interested in the probability that a new data point belongs to a particular mixture, I would use a probabilistic approach such as Gaussian mixture modeling either estimated by maximum likelihood or Bayes.

Maximum likelihood estimation of mixtures models is implemented in Matlab.

Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to place a Dirichlet Process prior on the mixture distribution and estimate by some Bayesian method. For instance, see this paper on infinite Gaussian mixture models. The DP mixture model will give you inference over the number of components and the components each elements belong to, which is exactly what you want. Alternatively you could perform model selection on the number of components, but this is generally less elegant.

There are many implementation of DP mixture models models, but they may not be as convenient. For instance, here's a Matlab implementation.

Your graph suggests you are an R user. In that case, if you are looking for prepacked solutions, the answer to your question lies on this Task View for cluster analysis.

like image 166
Tristan Avatar answered Oct 19 '22 10:10

Tristan


I think you are looking for something along the lines of a k-means clustering algorithm.

You should be able to find adequate implementations in most general purpose languages.

like image 44
ConsultUtah Avatar answered Oct 19 '22 12:10

ConsultUtah


You need one of clustering algorithms. All of them can be devided in 2 groups:

  1. you specify number of groups (clusters) - 2 clusters in your example
  2. algorithm try to guess correct number of clusters by itself

If you want algorithm of 1st type then K-Means is what you really need.

If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.

like image 42
Roman Avatar answered Oct 19 '22 10:10

Roman