I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with heapq.merge()
in a tree of depth log(n) than to sort the entire thing at once with sorted()
in O(nklog(nk)) time.
However, the sorted()
approach seems to be 17-44x faster on my machine. Is the implementation of sorted()
that much faster than heapq.merge()
that it outstrips the asymptotic time advantage of the classic merge?
import itertools
import heapq
data = [range(n*8000,n*8000+10000,100) for n in range(10000)]
# Approach 1
for val in heapq.merge(*data):
test = val
# Approach 2
for val in sorted(itertools.chain(*data)):
test = val
CPython's list.sort()
uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting range(N)*2
(in Python 2) for increasing values of N
, and you'll find the time needed grows linearly in N
.
So the only real advantage of heapq.merge()
in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).
In fact, list.sort()
is taking more advantage of the structure in your specific data than the heapq.merge()
approach. I have some insight into this, because I wrote Python's list.sort()
;-)
(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)
As discussed a bit in comments, list.sort()
plays lots of engineering tricks that may cut the number of comparisons needed over what heapq.merge()
needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):
class V(object):
def __init__(self, val):
self.val = val
def __lt__(a, b):
global ncmp
ncmp += 1
return a.val < b.val
def __eq__(a, b):
global ncmp
ncmp += 1
return a.val == b.val
def __le__(a, b):
raise ValueError("unexpected comparison")
__ne__ = __gt__ = __ge__ = __le__
sort()
was deliberately written to use only <
(__lt__
). It's more of an accident in heapq
(and, as I recall, even varies across Python versions), but it turns out .merge()
only required <
and ==
. So those are the only comparisons the class defines in a useful way.
Then changing your data to use instances of that class:
data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
for n in range(10000)]
Then run both methods:
ncmp = 0
for val in heapq.merge(*data):
test = val
print(format(ncmp, ","))
ncmp = 0
for val in sorted(itertools.chain(*data)):
test = val
print(format(ncmp, ","))
The output is kinda remarkable:
43,207,638
1,639,884
So sorted()
required far fewer comparisons than merge()
, for this specific data. And that's the primary reason it's much faster.
Those comparison counts looked too remarkable to me ;-) The count for heapq.merge()
looked about twice as large as I thought reasonable.
Took a while to track this down. In short, it's an artifact of the way heapq.merge()
is implemented: it maintains a heap of 3-element list objects, each containing the current next value from an iterable, the 0-based index of that iterable among all the iterables (to break comparison ties), and that iterable's __next__
method. The heapq
functions all compare these little lists (instead of just the iterables' values), and list comparison always goes thru the lists first looking for the first corresponding items that are not ==
.
So, e.g., asking whether [0] < [1]
first asks whether 0 == 1
. It's not, so then it goes on to ask whether 0 < 1
.
Because of this, each <
comparison done during the execution of heapq.merge()
actually does two object comparisons (one ==
, the other <
). The ==
comparisons are "wasted" work, in the sense that they're not logically necessary to solve the problem - they're just "an optimization" (which happens not to pay in this context!) used internally by list comparison.
So in some sense it would be fairer to cut the report of heapq.merge()
comparisons in half. But it's still way more than sorted()
needed, so I'll let it drop now ;-)
sorted
uses an adaptive mergesort that detects sorted runs and merges them efficiently, so it gets to take advantage of all the same structure in the input that heapq.merge
gets to use. Also, sorted
has a really nice C implementation with a lot more optimization effort put into it than heapq.merge
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With