Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed difference between iterating over generators and lists

In the following trivial examples there are two functions that sort a list of random numbers. The first method passes sorted a generator expression, the second method creates a list first:

import random
l = [int(1000*random.random()) for i in xrange(10*6)]

def sort_with_generator():
    return sorted(a for a in l)

def sort_with_list():
    return sorted([a for a in l])

Benchmarking with line profiler indicates that the second option (sort_with_list) is about twice as fast as the generator expression.

Can anyone explain what's happening, and why the first method is so much slower than the second?

like image 644
Hamish Avatar asked Feb 16 '23 08:02

Hamish


2 Answers

Your first example is a generator expression that iterates over a list. Your second example is a list expression that iterates over a list. Indeed, the second example is slightly faster.

>>> import timeit
>>> timeit("sorted(a for a in l)", setup="import random;l = [int(1000*random.random()) for i in xrange(10*6)]")
5.963912010192871
>>> timeit("sorted([a for a in l])", setup="import random;l = [int(1000*random.random()) for i in xrange(10*6)]")
5.021576881408691

The reason for this is undoubtedly that making a list is done in one go, while iterating over a generator requires function calls.

Generators are not to speed up small lists like this (you have 60 elements in the list, that's very small). It's to save memory when creating long lists, primarily.

like image 54
Lennart Regebro Avatar answered Mar 07 '23 08:03

Lennart Regebro


If you look at the source for sorted, any sequence you pass in gets copied into a new list first.

newlist = PySequence_List(seq);

generator --> list appears to be slower than list --> list.

>>> timeit.timeit('x = list(l)', setup = 'l = xrange(1000)')
16.656711101531982
>>> timeit.timeit('x = list(l)', setup = 'l = range(1000)')
4.525658845901489

As to why a copy must be made, think about how sorting works. Sorts aren't linear algorithms. We move through the data multiple times, sometimes traversing data in both directions. A generator is intended for producing a sequence through which we iterate once and only once, from start to somewhere after it. A list allows for random access.

On the other hand, creating a list from a generator will mean only one list in memory, while making a copy of a list will mean two lists in memory. Good 'ol fashioned space-time tradeoff.

Python uses Timsort, a hybrid of merge sort and insertion sort.

like image 28
Brian Avatar answered Mar 07 '23 06:03

Brian