In the following trivial examples there are two functions that sort a list of random numbers. The first method passes sorted a generator expression, the second method creates a list first:
import random
l = [int(1000*random.random()) for i in xrange(10*6)]
def sort_with_generator():
    return sorted(a for a in l)
def sort_with_list():
    return sorted([a for a in l])
Benchmarking with line profiler indicates that the second option (sort_with_list) is about twice as fast as the generator expression.
Can anyone explain what's happening, and why the first method is so much slower than the second?
Your first example is a generator expression that iterates over a list. Your second example is a list expression that iterates over a list. Indeed, the second example is slightly faster.
>>> import timeit
>>> timeit("sorted(a for a in l)", setup="import random;l = [int(1000*random.random()) for i in xrange(10*6)]")
5.963912010192871
>>> timeit("sorted([a for a in l])", setup="import random;l = [int(1000*random.random()) for i in xrange(10*6)]")
5.021576881408691
The reason for this is undoubtedly that making a list is done in one go, while iterating over a generator requires function calls.
Generators are not to speed up small lists like this (you have 60 elements in the list, that's very small). It's to save memory when creating long lists, primarily.
If you look at the source for sorted, any sequence you pass in gets copied into a new list first.
newlist = PySequence_List(seq);
generator --> list appears to be slower than list --> list.
>>> timeit.timeit('x = list(l)', setup = 'l = xrange(1000)')
16.656711101531982
>>> timeit.timeit('x = list(l)', setup = 'l = range(1000)')
4.525658845901489
As to why a copy must be made, think about how sorting works. Sorts aren't linear algorithms. We move through the data multiple times, sometimes traversing data in both directions. A generator is intended for producing a sequence through which we iterate once and only once, from start to somewhere after it. A list allows for random access.
On the other hand, creating a list from a generator will mean only one list in memory, while making a copy of a list will mean two lists in memory. Good 'ol fashioned space-time tradeoff.
Python uses Timsort, a hybrid of merge sort and insertion sort.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With