Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In search of algorithm to group similar data

Tags:

algorithm

php

A simple question, but an answer that has been tormenting me for days...

I have as input an array (php) of 2 aliases, let's say:

Array(
  Array(1,5),
  Array(6,8),
  Array(6,1),
  Array(9,3),
)

Each of those state "1" is the same as "5", "6" is the same as "8",... Simple, now I need to group those, looking at the example above, the algorithm should give me, if I ask nicely, two groups:

Array(1,5,6,8) and Array(9,3)

Simple commutation logic, but I cannot find a way of tackling it with code! Any guideline would be much appreciated!!

like image 371
DJules Avatar asked May 23 '11 19:05

DJules


People also ask

What is clustering algorithm used for?

It is often used as a data analysis technique for discovering interesting patterns in data, such as groups of customers based on their behavior. There are many clustering algorithms to choose from and no single best clustering algorithm for all cases.

How is clustering used in search engines?

Clusters created around each of these “concepts” associated with the term could lead the search engine to show a certain percentage of results covering the different concepts, and ensure diversity in those results. Comparing the relationship between words and concepts on a web page and in an advertisement.

What are the common clustering algorithms?

K-means clustering is the most commonly used clustering algorithm. It's a centroid-based algorithm and the simplest unsupervised learning algorithm. This algorithm tries to minimize the variance of data points within a cluster. It's also how most people are introduced to unsupervised machine learning.

What is clustering algorithm in data mining?

Cluster Analysis in Data Mining means that to find out the group of objects which are similar to each other in the group but are different from the object in other groups. In the process of clustering in data analytics, the sets of data are divided into groups or classes based on data similarity.


1 Answers

You can do this insanely fast using the union-find algorithm.

The idea is to have a forest of trees, where each tree represents elements "that are equal". You represent this tree as a simple array, where a[i] either can be the parent of i, -1 if i is the root, or say -2 if i is not yet in a tree.

In your case you would start with the simple tree:

1   
 \  
  5 

The next thing you want it to join 6 and 8. However, they are both unassigned, so you you add a new tree. (That is a[6]=-1, a[8]=6):

1    6   
 \    \  
  5    8 

Now you want to join 6 and 1. You find out which sets they belong to, by following their parents to the top. Coincidentally they are both roots. In this case we make the smallest tree the child of the other tree. (a[6]=1)

  1  
 / \ 
6   5
 \
  8

Finally we want to join 9 and 3, they are both unassigned, so we create a new tree. (a[3]=-1, a[9]=3)

  1    9
 / \    \
6   5    3
 \
  8

Say you also have 5=3? Follow their parents till you reach the roots. You find that they are not already equal, so you merge the trees. Since the 9 controls a less high tree, we add it to one's tree, to get:

  .1.
 / | \
6  9  5
 \  \
  8  3

As you can see it is now easy to check whether two elements are in the same set. Say you wanna test whether 8 and 3 are equal? Just follow their paths upwards, and you will see that 8 is in the set represented by one, and three in the set represented by 9. Hence they are unequal.

Using the heuristic of always putting the smaller tree under the bigger, gives you an log(n) average find or merge time. The Wikipedia article explains an extra trick, that will make it basically constant time.

like image 137
Thomas Ahle Avatar answered Oct 05 '22 15:10

Thomas Ahle