Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partition a list of sets by shared elements

Tags:

algorithm

sql

set

Here's the jist of the problem: Given a list of sets, such as:

[ (1,2,3), (5,2,6), (7,8,9), (6,12,13), (21,8,34), (19,20) ]

Return a list of groups of the sets, such that sets that have a shared element are in the same group.

[ [ (1,2,3), (5,2,6), (6,12,13) ], [ (7,8,9), (21,8,34) ], [ (19,20) ] ]

Note the stickeyness - the set (6,12,13) doesn't have a shared element with (1,2,3), but they get put in the same group because of (5,2,6).

To complicate matters, I should mention that I don't really have these neat sets, but rather a DB table with several million rows that looks like:

element | set_id
----------------
1       | 1
2       | 1
3       | 1
5       | 2
2       | 2
6       | 2

and so on. So I would love a way to do it in SQL, but I would be happy with a general direction for the solution.

EDIT: Changed the table column names to (element, set_id) instead of (key, group_id), to make the terms more consistent. Note that Kev's answer uses the old column names.

like image 656
itsadok Avatar asked Oct 07 '08 15:10

itsadok


2 Answers

The problem is exactly the computation of the connected components of an hypergraph: the integers are the vertices, and the sets are the hyperedges. A usual way of computing the connected components is by flooding them one after the other:

  • for all i = 1 to N, do:
  • if i has been tagged by some j < i, then continue (I mean skip to the next i)
  • else flood_from(i,i)

where flood_from(i,j) would be defined as

  • for each set S containing i, if it is not already tagged by j then:
  • tag S by j and for each element k of S, if k is not already tagged by j, then tag it by j, and call flood_from(k,j)

The tags of the sets then give you the connected components you are looking for.


In terms of databases, the algorithm can be expressed as follows: you add a TAG column to your database, and you compute the connected component of set i by doing

  • S = select all rows where set_id == i
  • set TAG to i for the rows in S
  • S' = select all rows where TAG is not set and where element is in element(S)
  • while S' is not empty, do
  • ---- set TAG to i for the rows in S'
  • ---- S'' = select all rows where TAG is not set and where element is in element(S')
  • ---- S = S union S'
  • ---- S' = S''
  • return set_id(S)

Another (theoretical) way of presenting this algorithm would be to say that you are looking for the fixed points of a mapping:

  • if A = {A1, ..., An} is a set of sets, define union(A) = A1 union ... union An
  • if K = {k1, ..., kp} is a set of integers, define incidences(K) = the set of sets which intersect K

Then if S is a set, the connected component of S is obtained by iterating (incidences)o(union) on S until a fixed point is reached:

  1. K = S
  2. K' = incidences(union(K)).
  3. if K == K', then return K, else K = K' and go to 2.
like image 169
Camille Avatar answered Sep 27 '22 23:09

Camille


You could think of it as a graph problem where the set (1,2,3) is connected to the set (5,2,6) via the 2. And then use a standard algorithm to fine the connected sub-graphs.

Here's a quick python implementation:

nodes = [ [1,2,3], [2,4,5], [6,7,8], [10,11,12], [7,10,13], [12], [] ]
links = [ set() for x in nodes ]

#first find the links
for n in range(len(nodes)):
    for item in nodes[n]:
        for m in range(n+1, len(nodes)):
            if (item in nodes[m]):
                links[n].add(m)
                links[m].add(n)

sets = []
nodes_not_in_a_set = range(len(nodes))

while len(nodes_not_in_a_set) > 0:
    nodes_to_explore = [nodes_not_in_a_set.pop()]
    current_set = set()
    while len(nodes_to_explore) > 0:
        current_node = nodes_to_explore.pop()
        current_set.add(current_node)
        if current_node in nodes_not_in_a_set:
            nodes_not_in_a_set.remove(current_node)
        for l in links[current_node]:
            if l not in current_set and l not in nodes_to_explore:
                nodes_to_explore.append(l)
    if len(current_set) > 0:
        sets.append(current_set)

for s in sets:
    print [nodes[n] for n in s]

output:

[[]]
[[6, 7, 8], [10, 11, 12], [7, 10, 13], [12]]
[[1, 2, 3], [2, 4, 5]]
like image 44
Matt Price Avatar answered Sep 27 '22 22:09

Matt Price