I have thousands of lines of 1 to 100 numbers, every line define a group of numbers and a relationship among them. I need to get the sets of related numbers.
Little Example: If I have this 7 lines of data
T1 T2
T3
T4
T5
T6 T1
T5 T4
T3 T4 T7
I need a not so slow algorithm to know that the sets here are:
T1 T2 T6 (because T1 is related with T2 in the first line and T1 related with T6 in the line 5)
T3 T4 T5 T7 (because T5 is with T4 in line 6 and T3 is with T4 and T7 in line 7)
but when you have very big sets is painfully slow to do a search of a T(x) in every big set, and do unions of sets... etc.
Do you have a hint to do this in a not so brute force manner?
I'm trying to do this in Python.
Once you have built the data structure, exactly what queries do you want to run against it? Show us your existing code. What is a T(x)? You talk about "groups of numbers" but your sample data shows T1, T2, etc; please explain.
Have you read this: http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Try looking at this Python implementation: http://code.activestate.com/recipes/215912-union-find-data-structure/
OR you can lash up something rather simple and understandable yourself, e.g.
[Update: totally new code]
class DisjointSet(object):
def __init__(self):
self.leader = {} # maps a member to the group's leader
self.group = {} # maps a group leader to the group (which is a set)
def add(self, a, b):
leadera = self.leader.get(a)
leaderb = self.leader.get(b)
if leadera is not None:
if leaderb is not None:
if leadera == leaderb: return # nothing to do
groupa = self.group[leadera]
groupb = self.group[leaderb]
if len(groupa) < len(groupb):
a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
groupa |= groupb
del self.group[leaderb]
for k in groupb:
self.leader[k] = leadera
else:
self.group[leadera].add(b)
self.leader[b] = leadera
else:
if leaderb is not None:
self.group[leaderb].add(a)
self.leader[a] = leaderb
else:
self.leader[a] = self.leader[b] = a
self.group[a] = set([a, b])
data = """T1 T2
T3 T4
T5 T1
T3 T6
T7 T8
T3 T7
T9 TA
T1 T9"""
# data is chosen to demonstrate each of 5 paths in the code
from pprint import pprint as pp
ds = DisjointSet()
for line in data.splitlines():
x, y = line.split()
ds.add(x, y)
print
print x, y
pp(ds.leader)
pp(ds.group)
and here is the output from the last step:
T1 T9
{'T1': 'T1',
'T2': 'T1',
'T3': 'T3',
'T4': 'T3',
'T5': 'T1',
'T6': 'T3',
'T7': 'T3',
'T8': 'T3',
'T9': 'T1',
'TA': 'T1'}
{'T1': set(['T1', 'T2', 'T5', 'T9', 'TA']),
'T3': set(['T3', 'T4', 'T6', 'T7', 'T8'])}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With