Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python heapq vs sorted speed for pre-sorted lists

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with heapq.merge() in a tree of depth log(n) than to sort the entire thing at once with sorted() in O(nklog(nk)) time.

However, the sorted() approach seems to be 17-44x faster on my machine. Is the implementation of sorted() that much faster than heapq.merge() that it outstrips the asymptotic time advantage of the classic merge?

import itertools
import heapq

data = [range(n*8000,n*8000+10000,100) for n in range(10000)]

# Approach 1
for val in heapq.merge(*data):
    test = val

# Approach 2
for val in sorted(itertools.chain(*data)):
    test = val
like image 354
Katie Avatar asked Jul 12 '16 23:07

Katie


2 Answers

CPython's list.sort() uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting range(N)*2 (in Python 2) for increasing values of N, and you'll find the time needed grows linearly in N.

So the only real advantage of heapq.merge() in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).

In fact, list.sort() is taking more advantage of the structure in your specific data than the heapq.merge() approach. I have some insight into this, because I wrote Python's list.sort() ;-)

(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)

ABOUT THAT "more advantage"

As discussed a bit in comments, list.sort() plays lots of engineering tricks that may cut the number of comparisons needed over what heapq.merge() needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):

class V(object):
    def __init__(self, val):
        self.val = val

    def __lt__(a, b):
        global ncmp
        ncmp += 1
        return a.val < b.val

    def __eq__(a, b):
        global ncmp
        ncmp += 1
        return a.val == b.val

    def __le__(a, b):
        raise ValueError("unexpected comparison")

    __ne__ = __gt__ = __ge__ = __le__

sort() was deliberately written to use only < (__lt__). It's more of an accident in heapq (and, as I recall, even varies across Python versions), but it turns out .merge() only required < and ==. So those are the only comparisons the class defines in a useful way.

Then changing your data to use instances of that class:

data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
        for n in range(10000)]

Then run both methods:

ncmp = 0
for val in heapq.merge(*data):
    test = val
print(format(ncmp, ","))

ncmp = 0
for val in sorted(itertools.chain(*data)):
    test = val
print(format(ncmp, ","))

The output is kinda remarkable:

43,207,638
1,639,884

So sorted() required far fewer comparisons than merge(), for this specific data. And that's the primary reason it's much faster.

LONG STORY SHORT

Those comparison counts looked too remarkable to me ;-) The count for heapq.merge() looked about twice as large as I thought reasonable.

Took a while to track this down. In short, it's an artifact of the way heapq.merge() is implemented: it maintains a heap of 3-element list objects, each containing the current next value from an iterable, the 0-based index of that iterable among all the iterables (to break comparison ties), and that iterable's __next__ method. The heapq functions all compare these little lists (instead of just the iterables' values), and list comparison always goes thru the lists first looking for the first corresponding items that are not ==.

So, e.g., asking whether [0] < [1] first asks whether 0 == 1. It's not, so then it goes on to ask whether 0 < 1.

Because of this, each < comparison done during the execution of heapq.merge() actually does two object comparisons (one ==, the other <). The == comparisons are "wasted" work, in the sense that they're not logically necessary to solve the problem - they're just "an optimization" (which happens not to pay in this context!) used internally by list comparison.

So in some sense it would be fairer to cut the report of heapq.merge() comparisons in half. But it's still way more than sorted() needed, so I'll let it drop now ;-)

like image 165
Tim Peters Avatar answered Nov 03 '22 09:11

Tim Peters


sorted uses an adaptive mergesort that detects sorted runs and merges them efficiently, so it gets to take advantage of all the same structure in the input that heapq.merge gets to use. Also, sorted has a really nice C implementation with a lot more optimization effort put into it than heapq.merge.

like image 25
user2357112 supports Monica Avatar answered Nov 03 '22 08:11

user2357112 supports Monica