Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better to add item to a set, or convert final list to a set?

Tags:

python

loops

set

I have some data that looks something like this:

ID1 ID2 ID3  
ID1 ID4 ID5  
ID3 ID5 ID7 ID6  
...  
...  

where each row is a group.

My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it.

For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }

I can think of 3 options for this, and I'm wondering which is (generally) best:

  1. Check whether an ID is already in the list before adding it
  2. Create sets instead of lists, and add each ID to the set
  3. Add all IDs to the list, then convert all of the lists to sets at the end.
like image 940
Jeremy Avatar asked Sep 15 '13 00:09

Jeremy


People also ask

Is set add faster than list append?

The set is far faster, in general. Testing for membership in a list is O(n), linear in the size of the list. Adding to a set is O(1), independent of the number of the items in the list.

What is more efficient set or list?

Lists are slightly faster than sets when you just want to iterate over the values. Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.

Does converting an object to a set maintain the object order?

1. Does converting an object to a set maintain the object's order? No. A set is not an ordered data structure, so order is not maintained.

What happens when you convert a set to a list?

Since the set has been converted to a list, the elements are now ordered. That means we can use the get() method to access elements by index.


2 Answers

Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)

import pprint
import collections

data = '''ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6'''

groups = collections.defaultdict(set)

for row in data.split('\n'):
    cols = row.split()
    for groupcol in cols:
        for col in cols:
            if col is not groupcol:
                groups[groupcol].add(col)

pprint.pprint(dict(groups))

Results:

{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
 'ID2': set(['ID1', 'ID3']),
 'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
 'ID4': set(['ID1', 'ID5']),
 'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
 'ID6': set(['ID3', 'ID5', 'ID7']),
 'ID7': set(['ID3', 'ID5', 'ID6'])}
like image 20
Wolph Avatar answered Oct 31 '22 00:10

Wolph


TL;DR: Go with option 2. Just use sets from the start.

In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

like image 166
cbarrick Avatar answered Oct 31 '22 02:10

cbarrick