Efficiency of sorting by multiple keys in Python

Tags:

I have a list of strings that I want to sort by two custom key functions in Python 3.6. Comparing the multi-sort approach (sorting by the lesser key then by the major key) to the multi-key approach (taking the key as the tuple (major_key, lesser_key)), I could see the latter being more than 2x slower than the former, which was a surprise as I thought they are equivalent. I would like to understand why this is so.

import random
from time import time

largest = 1000000
length = 10000000
start = time()
lst = [str(x) for x in random.choices(range(largest), k=length)]
t0 = time() - start

start = time()
tmp = sorted(lst, key=lambda x: x[::2])
l1 = sorted(tmp, key=lambda x: ''.join(sorted(x)))
t1 = time() - start

start = time()
l2 = sorted(lst, key=lambda x: (''.join(sorted(x)), x[::2]))
t2 = time() - start

print(f'prepare={t0} multisort={t1} multikey={t2} slowdown={t2/t1}')

assert l1 == l2

424

asked Oct 06 '21 15:10

o17t H1H' S'k

Video Answer

2 Answers

Here's a third way to time:

start = time()
l3 = sorted(lst, key=lambda x: (''.join(sorted(x)) + "/" + x[::2]))
t3 = time() - start

and expanding the last line to

assert l1 == l2 == l3

This uses a single string as the key, but combining the two string keys you view as as being the "primary" and "secondary" keys. Note that:

>>> chr(ord("0") - 1)
'/'

That's why the two keys can be combined - they're separated by an ASCII character that compares "less than" any ASCII digit (of course this is wholly specific to the precise kind of keys you're using).

This is typically a little faster than multisort() for me, using the precise program you posted.

prepare=3.628943920135498 multisort=15.646344423294067 multikey=34.255955934524536 slowdown=2.1893903782103075 onekey=15.11461067199707

I believe the primary reason "why" is briefly explained at the end of a modern CPython distribution's Objects/listsort.txt:

As noted above, even the simplest Python comparison triggers a large pile of C-level pointer dereferences, conditionals, and function calls. This can be partially mitigated by pre-scanning the data to determine whether the data is homogeneous with respect to type. If so, it is sometimes possible to substitute faster type-specific comparisons for the slower, generic PyObject_RichCompareBool.

When there's a single string used as the key, this pre-sorting scan deduces that all the keys in the list are in fact strings, so all the runtime expense of figuring out which comparison function to call can be skipped: sorting can always call the string-specific comparison function instead of the all-purpose (and significantly more expensive) PyObject_RichCompareBool.

multisort() also benefits from that optimization.

But multikey() doesn't, much. The pre-sorting scan sees that all the keys are tuples, but the tuple comparison function itself can't assume anything about the types of the tuple's elements: it has to resort to PyObject_RichCompareBool every time it's invoked. (Note: as touched on in comments, it's not really quite that simple: some optimization is still done exploiting that the keys are all tuples, but it doesn't always pay off, and at best is less effective - and see the next section for clearer evidence of that.)

Focus

There's a whole lot going on in the test case, which leads to needing ever-greater effort to explain ever-smaller distinctions.

So to look at the effects of the type homogeneity optimizations, let's simplify things a lot: no key function at all. Like so:

from random import random, seed
from time import time

length = 10000000
seed(1234567891)
xs = [random() for _ in range(length)]

ys = xs[:]
start = time()
ys.sort()
e1 = time() - start

ys = [(x,) for x in xs]
start = time()
ys.sort()
e2 = time() - start

ys = [[x] for x in xs]
start = time()
ys.sort()
e3 = time() - start
print(e1, e2, e3)

Here's typical output on my box:

3.1991195678710938 12.756590843200684 26.31903386116028

So it's by far fastest to sort floats directly. It's already very damaging to stick the floats in 1-tuples, but the optimization still gives highly significant benefit: it takes over twice as long again to stick the floats in singleton lists. In that last case (and only in that last case), PyObject_RichCompareBool is always called.

answered Oct 21 '22 03:10

Tim Peters

Update

Tim and a_guest nudged me to analyze more. Here are counts of the different comparisons (per element) and other statistics:

               length:     10,000       100,000      1,000,000     10,000,000
                      | msort  mkey | msort  mkey | msort  mkey | msort  mkey |
----------------------+-------------+-------------+-------------+-------------+
 < ''.join(sorted(x)) | 11.64 11.01 | 13.86 11.88 | 13.10 12.00 | 12.16 12.06 |
 < x[::2]             | 11.99  0.96 | 13.51  3.20 | 13.77  5.35 | 13.79  6.68 |
== ''.join(sorted(x)) |       11.99 |       15.29 |       18.60 |       21.26 |
== x[::2]             |        0.98 |        3.42 |        6.60 |        9.20 |
----------------------+-------------+-------------+-------------+-------------+
time, μs per element  |  0.84  1.03 |  0.95  1.42 |  1.31  2.35 |  1.39  3.15 |
----------------------+-------------+-------------+-------------+-------------+
tracemalloc peak MiB  |  0.82  1.67 |  8.26 17.71 |  82.7 178.0 |   826  1780 |
----------------------+-------------+-------------+-------------+-------------+
sys.getsizeof MiB     |  0.58  1.80 |  5.72 17.82 |  57.6 179.5 |   581  1808 |
                      |  0.60       |  6.00       |  60.4       |   608       |
----------------------+-------------+-------------+-------------+-------------+

Numbers of comparisons

The numbers of comparisons are per element, i.e., I counted all comparisons and divided by the list size. So for example for a list with a million elements, multisort did 13.77 < comparisons per x[::2] string and then 13.10 < comparisons per ''.join(sorted(x)) string. Whereas multikey has many == comparisons of the strings and fewer < comparisons of the strings. As Tim pointed out, unsafe_tuple_compare uses the slower PyObject_RichCompareBool(..., Py_EQ) before it uses the faster unsafe_latin_compare (on the ''.join(sorted(x)) strings) or another slower PyObject_RichCompareBool(..., Py_LT) (on the x[::2] strings).

A key thing to note is that multisort's numbers of comparisons per element stay roughly constant and it only uses the faster unsafe_latin_compare. Whereas multikey's numbers of comparisons other than the < comparisons on ''.join(sorted(x)) grow rapidly, including the additional slower equality comparisons.

That has to do with your data, as x is the string representation of an integer from 0 to 1,000,000. That causes more and more duplicates, both for ''.join(sorted(x)) (which also has duplicates from the sorting, e.g., 314159 and 911345 both become '113459') and for x[::2] (which also has duplicates from the slicing, e.g., 123456 and 183957 both become '135').

For multisort this is good, as duplicates mean less work for it - equal things dont need to be "sorted further" (and I think streaks of equal things are good for Timsort's galloping).

But multikey suffers from from the duplicates, because the == comparisons on ''.join(sorted(x)) result in "they're equal" more often, leading to more comparisons of the x[::2] strings. And these x[::2] often turn out unequal, note how the numbers for < x[::2] are relatively large compared to the == x[::2] comparisons (e.g., at ten million elements, there were 9.20 == comparisons and 6.68 < comparisons, so 73% of the time they were unequal). Which means the tuples also often turn out unequal, so they do need to be "sorted further" (and I think it's bad for galloping). And this further sorting will compare whole tuples again, also meaning even more == comparisons of the ''.join(sorted(x)) strings even though they're equal! That's why in multikey, the < comparisons of the ''.join(sorted(x)) strings remain fairly stable (from 11.01 per string for 10,000 elements, up to only 12.06 for 10 million elements) while their == comparisons grow so much (from 11.99 up to 21.26).

Times and Memory

The runtimes in the above table also reflect the little growth for multisort and the large growth for multikey. The one time where multisort got really quite a bit slower was at the step from 100,000 to 1,000,000 elements, going from 0.95 μs to 1.31 μs per element. As can be seen from the tracemalloc and sys.getsizeof rows of my table, its memory stepped up from ~6 MiB to ~59 MiB, and the CPU has about 34 MiB cache, so cache misses might be responsible for that.

For the sys.getsizeof values I put all keys into a list and add the size of the list (since sorted also stores them) and the sizes of all strings/tuples in the list. For multisort, my table shows two values separately, as the two sorts happen one after the other, and the memory for the first sort is released before the second sort.

(Note: runtimes differ from my original answer as I had to use a different computer to sort ten million elements.)

Code for counting comparisons

I wrap the strings in an S object that increases a counter for each comparison.

import random
from collections import Counter

class S:
    def __init__(self, value, label):
        self.value = value
        self.label = label
    def __eq__(self, other):
        comparisons['==', self.label] += 1
        return self.value == other.value
    def __ne__(self, other):
        comparisons['!=', self.label] += 1
        return self.value != other.value
    def __lt__(self, other):
        comparisons['<', self.label] += 1
        return self.value < other.value
    def __le__(self, other):
        comparisons['<=', self.label] += 1
        return self.value <= other.value
    def __ge__(self, other):
        comparisons['>=', self.label] += 1
        return self.value >= other.value
    def __gt__(self, other):
        comparisons['>', self.label] += 1
        return self.value > other.value

def key1(x):
    return S(''.join(sorted(x)), "''.join(sorted(x))")

def key2(x):
    return S(x[::2], 'x[::2]')

def multisort(l):
    tmp = sorted(l, key=lambda x: key2(x))
    return sorted(tmp, key=lambda x: key1(x))

def multikey(l):
    return sorted(l, key=lambda x: (key1(x), key2(x)))

funcs = [
    multisort,
    multikey,
]

def test(length, rounds):
    print(f'{length = :,}')

    largest = 1000000

    for _ in range(3):
        lst = list(map(str, random.choices(range(largest + 1), k=length)))
        for func in funcs:
            global comparisons
            comparisons = Counter()
            func(lst)
            print(func.__name__)
            for comparison, frequency in comparisons.most_common():
                print(f'{frequency / length:5.2f}', *comparison)
            print()

test(10_000, 1)
test(100_000, 10)
test(1_000_000, 1)
test(10_000_000, 1)

Original answer

Yeah, I've often used / pointed out that two simpler sorts can be faster than one more complex sort. Here are benchmarks with a few more versions:

At length = 10,000 elements, multikey took about 1.16 times as long as multisort:

multisort           12.09 ms  12.21 ms  12.32 ms  (no previous result)
multikey            14.13 ms  14.14 ms  14.14 ms  (same as previous result)
multisort_tupled    15.40 ms  15.61 ms  15.70 ms  (same as previous result)
multisort_inplaced  11.46 ms  11.49 ms  11.49 ms  (same as previous result)

At length = 100,000, it took about 1.43 times as long:

length = 100,000
multisort           0.162 s  0.164 s  0.164 s  (no previous result)
multikey            0.231 s  0.233 s  0.237 s  (same as previous result)
multisort_tupled    0.222 s  0.225 s  0.227 s  (same as previous result)
multisort_inplaced  0.156 s  0.157 s  0.158 s  (same as previous result)

At length = 1,000,000, it took about 2.15 times as long:

multisort           1.894 s  1.897 s  1.898 s  (no previous result)
multikey            4.026 s  4.060 s  4.163 s  (same as previous result)
multisort_tupled    2.734 s  2.765 s  2.771 s  (same as previous result)
multisort_inplaced  1.840 s  1.841 s  1.863 s  (same as previous result)

Reasons I see:

Building tuples costs extra time, and comparing tuples of things is slower than comparing just those things. See the times of multisort_tupled, where in one of the two sorts I wrap each real key in a single-value tuple. That made it much slower.
With large data, cpu/memory caches play a more significant role. Having extra tuples with two things per tuple is three times as many objects in memory as just having one of those things. Leads to more cache misses. Or even to (more) swap file usage if you go real big.
~~Tuples are registered for the reference cycle garbage collector, which can play a significant role. We don't produce reference cycles here, so we can disable it during the sorting. Speeds it up a bit.~~ Edit: I still think it should be at least slightly faster, but with more testing I got conflicting times regarding this, so took this out.

Btw, notice my multisort_inplaced is a bit faster still. If you do multiple sorts, it's pointless to do them all with sorted. After the first one, just use inplace sort. No reason to create further new lists, which takes time/memory for the lists and takes time to update the reference counts for all the list elements.

Benchmark code (Try it online!):

import random
from time import perf_counter as timer

def multisort(l):
    tmp = sorted(l, key=lambda x: x[::2])
    return sorted(tmp, key=lambda x: ''.join(sorted(x)))

def multikey(l):
    return sorted(l, key=lambda x: (''.join(sorted(x)), x[::2]))

def multisort_tupled(l):
    tmp = sorted(l, key=lambda x: x[::2])
    return sorted(tmp, key=lambda x: (''.join(sorted(x)),))

def multisort_inplaced(l):
    tmp = sorted(l, key=lambda x: x[::2])
    tmp.sort(key=lambda x: ''.join(sorted(x)))
    return tmp

funcs = [
    multisort,
    multikey,
    multisort_tupled,
    multisort_inplaced,
]

def test(length, rounds):
    print(f'{length = :,}')

    largest = 1000000
    lst = list(map(str, random.choices(range(largest + 1), k=length)))

    prev = None
    for func in funcs:
        times = []
        for _ in range(5):
            time = 0
            for _ in range(rounds):
                copy = lst.copy()
                start = timer()
                result = func(copy)
                time += timer() - start
            times.append(time / rounds)
        print('%-19s' % func.__name__,
              *('%5.2f ms ' % (t * 1e3) for t in sorted(times)[:3]),
              # *('%5.3f s ' % t for t in sorted(times)[:3]),
              '(%s previous result)' % ('no' if prev is None else 'same as' if result == prev else 'differs from'))
        prev = result

test(10_000, 10)
# test(100_000, 10)
# test(1_000_000, 1)

answered Oct 21 '22 02:10

Stefan Pochmann

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiency of sorting by multiple keys in Python

Tags:

python

sorting

python-internals

o17t H1H' S'k

People also ask

Video Answer

2 Answers

Focus

Tim Peters

Update

Numbers of comparisons

Times and Memory

Code for counting comparisons

Original answer

Stefan Pochmann

Recent Activity

Donate For Us

Efficiency of sorting by multiple keys in Python

Tags:

python

sorting

python-internals

o17t H1H' S'k

People also ask

Video Answer

2 Answers

Focus

Tim Peters

Update

Numbers of comparisons

Times and Memory

Code for counting comparisons

Original answer

Stefan Pochmann

Related questions

Recent Activity

Donate For Us