I have some data that looks something like this: <pre class="prettyprint"><code>ID1 ID2 ID3 ID1 ID4 ID5 ID3 ID5 ID7 ID6 ... ... </code></pre> where each row is a group. My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it. For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... } I can think of 3 options for this, and I'm wondering which is (generally) best: <ol> <li>Check whether an ID is already in the list before adding it</li> <li>Create sets instead of lists, and add each ID to the set</li> <li>Add all IDs to the list, then convert all of the lists to sets at the end.</li> </ol>

TL;DR: Go with option 2. Just use sets from the start. In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is <code>O(1)</code> for both, but checking if an element exists is <code>O(n)</code> for the list and <code>O(1)</code> for the set. So option 1 is immediately out. If you are inserting <code>n</code> items and need to check the list every time, then the overall complexity becomes <code>O(n^2)</code>. Options 2 and 3 are both optimal at <code>O(n)</code> overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

Better to add item to a set, or convert final list to a set?

Tags:

python

loops

set

I have some data that looks something like this:

ID1 ID2 ID3  
ID1 ID4 ID5  
ID3 ID5 ID7 ID6  
...  
...

where each row is a group.

My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it.

For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }

I can think of 3 options for this, and I'm wondering which is (generally) best:

Check whether an ID is already in the list before adding it
Create sets instead of lists, and add each ID to the set
Add all IDs to the list, then convert all of the lists to sets at the end.

940

asked Sep 15 '13 00:09

Jeremy

2 Answers

Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)

import pprint
import collections

data = '''ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6'''

groups = collections.defaultdict(set)

for row in data.split('\n'):
    cols = row.split()
    for groupcol in cols:
        for col in cols:
            if col is not groupcol:
                groups[groupcol].add(col)

pprint.pprint(dict(groups))

Results:

{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
 'ID2': set(['ID1', 'ID3']),
 'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
 'ID4': set(['ID1', 'ID5']),
 'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
 'ID6': set(['ID3', 'ID5', 'ID7']),
 'ID7': set(['ID3', 'ID5', 'ID6'])}

answered Oct 31 '22 00:10

Wolph

TL;DR: Go with option 2. Just use sets from the start.

In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

166

answered Oct 31 '22 02:10

cbarrick

Related questions
                            
                                Debugging slow Django Admin views [closed]
                            
                                Python Object as Dictionary Value [closed]
                            
                                Python image library (PIL), how to compress image into desired file size?
                            
                                Django get_or_create, how to say commit=False
                            
                                Convert JSON to CSV using Python (Idle)
                            
                                Pymongo cursor limit(1) returns more than 1 result
                            
                                Including Local Variables in Django Error Emails
                            
                                How to specify in YAML to always create log file in the project's folder using dictConfig?
                            
                                how to access form data using flask?
                            
                                How to implement a watchdog timer in Python?
                            
                                Where I should put my python scripts in Linux?
                            
                                How to determine the learning rate and the variance in a gradient descent algorithm？
                            
                                set ipython's default scientific notation threshold
                            
                                Ubuntu add directory to Python path
                            
                                Parsing through edges in NetworkX graph
                            
                                Better Function Composition in Python
                            
                                matplotlib bitmap plot with vector text
                            
                                Get subset of most frequent dummy variables in pandas
                            
                                Writing Percentages in Excel Using Pandas
                            
                                Firefox + Selenium WebDriver and download a csv file automatically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With