Here's the jist of the problem: Given a list of sets, such as:
[ (1,2,3), (5,2,6), (7,8,9), (6,12,13), (21,8,34), (19,20) ]
Return a list of groups of the sets, such that sets that have a shared element are in the same group.
[ [ (1,2,3), (5,2,6), (6,12,13) ], [ (7,8,9), (21,8,34) ], [ (19,20) ] ]
Note the stickeyness - the set (6,12,13) doesn't have a shared element with (1,2,3), but they get put in the same group because of (5,2,6).
To complicate matters, I should mention that I don't really have these neat sets, but rather a DB table with several million rows that looks like:
element | set_id
----------------
1 | 1
2 | 1
3 | 1
5 | 2
2 | 2
6 | 2
and so on. So I would love a way to do it in SQL, but I would be happy with a general direction for the solution.
EDIT: Changed the table column names to (element, set_id) instead of (key, group_id), to make the terms more consistent. Note that Kev's answer uses the old column names.
The problem is exactly the computation of the connected components of an hypergraph: the integers are the vertices, and the sets are the hyperedges. A usual way of computing the connected components is by flooding them one after the other:
where flood_from(i,j) would be defined as
The tags of the sets then give you the connected components you are looking for.
In terms of databases, the algorithm can be expressed as follows: you add a TAG column to your database, and you compute the connected component of set i by doing
Another (theoretical) way of presenting this algorithm would be to say that you are looking for the fixed points of a mapping:
Then if S is a set, the connected component of S is obtained by iterating (incidences)o(union) on S until a fixed point is reached:
You could think of it as a graph problem where the set (1,2,3) is connected to the set (5,2,6) via the 2. And then use a standard algorithm to fine the connected sub-graphs.
Here's a quick python implementation:
nodes = [ [1,2,3], [2,4,5], [6,7,8], [10,11,12], [7,10,13], [12], [] ]
links = [ set() for x in nodes ]
#first find the links
for n in range(len(nodes)):
for item in nodes[n]:
for m in range(n+1, len(nodes)):
if (item in nodes[m]):
links[n].add(m)
links[m].add(n)
sets = []
nodes_not_in_a_set = range(len(nodes))
while len(nodes_not_in_a_set) > 0:
nodes_to_explore = [nodes_not_in_a_set.pop()]
current_set = set()
while len(nodes_to_explore) > 0:
current_node = nodes_to_explore.pop()
current_set.add(current_node)
if current_node in nodes_not_in_a_set:
nodes_not_in_a_set.remove(current_node)
for l in links[current_node]:
if l not in current_set and l not in nodes_to_explore:
nodes_to_explore.append(l)
if len(current_set) > 0:
sets.append(current_set)
for s in sets:
print [nodes[n] for n in s]
output:
[[]]
[[6, 7, 8], [10, 11, 12], [7, 10, 13], [12]]
[[1, 2, 3], [2, 4, 5]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With