I have some data that looks something like this:
ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6
...
...
where each row is a group.
My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it.
For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }
I can think of 3 options for this, and I'm wondering which is (generally) best:
The set is far faster, in general. Testing for membership in a list is O(n), linear in the size of the list. Adding to a set is O(1), independent of the number of the items in the list.
Lists are slightly faster than sets when you just want to iterate over the values. Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.
1. Does converting an object to a set maintain the object's order? No. A set is not an ordered data structure, so order is not maintained.
Since the set has been converted to a list, the elements are now ordered. That means we can use the get() method to access elements by index.
Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)
import pprint
import collections
data = '''ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6'''
groups = collections.defaultdict(set)
for row in data.split('\n'):
cols = row.split()
for groupcol in cols:
for col in cols:
if col is not groupcol:
groups[groupcol].add(col)
pprint.pprint(dict(groups))
Results:
{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
'ID2': set(['ID1', 'ID3']),
'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
'ID4': set(['ID1', 'ID5']),
'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
'ID6': set(['ID3', 'ID5', 'ID7']),
'ID7': set(['ID3', 'ID5', 'ID6'])}
TL;DR: Go with option 2. Just use sets from the start.
In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1)
for both, but checking if an element exists is O(n)
for the list and O(1)
for the set.
So option 1 is immediately out. If you are inserting n
items and need to check the list every time, then the overall complexity becomes O(n^2)
.
Options 2 and 3 are both optimal at O(n)
overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With