Optimization Break-even Point: iterate many times over set or convert to list first?

Question

Here's something I've always wondered about. I'll pose the question for Python, but I would also welcome answers which address the standard libraries in Java and C++.

Let's say you have a Python list called "my_list", and you would like to iterate over its unique elements. There are two natural approaches:

#iterate over set
for x in set(my_list):
    do_something(x)

or

#list to set to list
for x in list(set(my_list)):
    do_something(x)

The tension is that iterating over a list is faster than iterating over a set, but it takes time to convert a set to a list. My guess is that the answer to this question will depend on a number of factors, such as:

How many times will we need to iterate?
How big is the original list?
How many repetitions in the original list should we expect?

So I guess I'm looking for a rule of thumb of the form "If the list has x many elements with each element repeated no more than y times and you only need to iterate z times then you should iterate over the set; otherwise you should convert it to a list."

Russia Must Remove Putin · Accepted Answer

I'm looking for a rule of thumb...

Rule of thumb

Here's the best rule of thumb for writing optimal Python: use as few intermediary steps as possible and avoid materializing unnecessary data structures.

Applied to this question: sets are iterable. Don't convert them to another data structure just to iterate over them. Trust Python to know the fastest way to iterate over sets. If it were faster to convert them to lists, Python would do that.

On optimization:

Don't attempt to prematurely optimize by adding complexity to your program. If your program takes too long, profile it, then optimize the bottlenecks. If you're using Python, you're probably more concerned about development time than exactly how long your program takes to run.

Demonstration

In Python 2.7:

import collections
import timeit
blackhole = collections.deque(maxlen=0).extend
s = set(xrange(10000))

We see for larger n, simpler is better:

>>> timeit.timeit(lambda: blackhole(s))
108.87403416633606
>>> timeit.timeit(lambda: blackhole(list(s)))
189.0135440826416

And for smaller n the same relationship holds:

>>> s = set(xrange(10))
>>> timeit.timeit(lambda: blackhole(s))
0.2969839572906494
>>> timeit.timeit(lambda: blackhole(list(s)))
0.630713939666748

Yes, lists iterate faster than sets (try it on your own Python interpreter):

l = list(s)
timeit.repeat(lambda: blackhole(l))

But that doesn't mean you should convert sets to lists just for iteration.

Break-even Analysis

OK, so you've profiled your code and found you're iterating over a set a lot (and I presume the set is static, else what we're doing is very problematic). I hope you're familiar with the set methods, and aren't replicating that functionality. (I also think you should consider linking frozensets with tuples, because using a list (mutable) to substitute for a canonical set (also mutable) seems like it could be error-prone.) So with that caveat, let's do an analysis.

It may be that by making an investment in complexity and taking greater risks of errors from more lines of code you can get a good payoff. This analysis will demonstrate the breakeven point on this. I do not know how much more performance you would need to pay for the greater risks and dev time, but this will tell you at what point you can start to pay towards those.:

import collections
import timeit
import pandas as pd

BLACKHOLE = collections.deque(maxlen=0).extend

SET = set(range(1000000))

def iterate(n, iterable):
    for _ in range(n):
        BLACKHOLE(iterable)

def list_iterate(n, iterable):
    l = list(iterable)
    for _ in range(n):
        BLACKHOLE(l)

columns = ('ConvertList', 'n', 'seconds')

def main():
    results = {c: [] for c in columns}

    for n in range(21):
        for fn in (iterate, list_iterate):
            time = min(timeit.repeat((lambda: fn(n, SET)), number=10))
            results['ConvertList'].append(fn.__name__)
            results['n'].append(n)
            results['seconds'].append(time)

    df = pd.DataFrame(results)
    df2 = df.pivot('n', 'ConvertList')
    df2.plot()
    import pylab
    pylab.show()

And it looks like your break-even point is at 5 complete iterations. With 5 or less on average, it could not possibly make sense to do this. Only at 5 or more do you begin to compensate for the additional development time, complexity (increasing maintenance costs), and risk from the greater number of lines of code.

I think you'd have to be doing it quite a lot to be worth tacking on the added complexity and lines of code to your project.

enter image description here

These results were created with Anaconda's Python 2.7, from an Ubuntu 14.04 terminal. You may get varying results with different implementations and versions.

Concerns

What I'm concerned about is that sets are mutable, and lists are mutable. A set will prevent you from modifying it while iterating over it, but a list created from that set will not:

>>> s = set('abc')
>>> for e in s:
...     s.add(e + e.upper())
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Set changed size during iteration

If you modify your set while iterating over the derivative list, you won't get an error to tell you that you did that.

>>> for e in list(s):
...     s.add(e + e.upper())

That's why I also suggested to use frozensets and tuples instead. It will be a builtin guard against a semantically incorrect alteration of your data.

>>> s = frozenset('abc')
>>> t_s = tuple(s)
>>> for e in t_s:
...     s.add(e + e.upper())
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
AttributeError: 'frozenset' object has no attribute 'add'

In the end, you have to trust yourself to get your algorithm correct. Frequently, I'm told I gave good advice when I warn newer Python users about these sorts of things. They learn it was good advice because they didn't listen at first, and found it created unnecessary complexity, complications, and resulting problems. But there's also things like logical correctness that you'll only have yourself to blame if you don't get right. Minimizing things that can go wrong is a benefit that is usually worth the performance tradeoff. And again, if performance (and not correctness or speed of development) was a prime concern while tackling this project, you wouldn't be using Python.

Optimization Break-even Point: iterate many times over set or convert to list first?

Tags:

python

iterator

list

optimization

hash

Paul Siegel

1 Answers

I'm looking for a rule of thumb...

Rule of thumb

On optimization:

Demonstration

Break-even Analysis

Concerns

Russia Must Remove Putin

Recent Activity

Donate For Us

Optimization Break-even Point: iterate many times over set or convert to list first?

Tags:

python

iterator

list

optimization

hash

Paul Siegel

1 Answers

I'm looking for a rule of thumb...

Rule of thumb

On optimization:

Demonstration

Break-even Analysis

Concerns

Russia Must Remove Putin

Related questions

Recent Activity

Donate For Us