I have a list of strings that I want to sort by two custom key functions in Python 3.6. Comparing the multi-sort approach (sorting by the lesser key then by the major key) to the multi-key approach (taking the key as the tuple (major_key, lesser_key)
), I could see the latter being more than 2x slower than the former, which was a surprise as I thought they are equivalent. I would like to understand why this is so.
import random
from time import time
largest = 1000000
length = 10000000
start = time()
lst = [str(x) for x in random.choices(range(largest), k=length)]
t0 = time() - start
start = time()
tmp = sorted(lst, key=lambda x: x[::2])
l1 = sorted(tmp, key=lambda x: ''.join(sorted(x)))
t1 = time() - start
start = time()
l2 = sorted(lst, key=lambda x: (''.join(sorted(x)), x[::2]))
t2 = time() - start
print(f'prepare={t0} multisort={t1} multikey={t2} slowdown={t2/t1}')
assert l1 == l2
@moose, @Amyth, to reverse to only one attribute, you can sort twice: first by the secondary s = sorted(s, key = operator. itemgetter(2)) then by the primary s = sorted(s, key = operator.
The previous investigations showed us, that list. sort is slightly faster than sorted and consumes around 24% less memory. However, keep in mind that list. sort is only implemented for lists, whereas sorted accepts any iterable.
A best sorting algorithm in python The time complexity of quicksort is O(n log n) in the best case, O(n log n) in the average case, and O(n^2) in the worst case. Quicksort is also considered as the ” fastest” sorting algorithm because it has the best performance in the average case for most inputs.
Here's a third way to time:
start = time()
l3 = sorted(lst, key=lambda x: (''.join(sorted(x)) + "/" + x[::2]))
t3 = time() - start
and expanding the last line to
assert l1 == l2 == l3
This uses a single string as the key, but combining the two string keys you view as as being the "primary" and "secondary" keys. Note that:
>>> chr(ord("0") - 1)
'/'
That's why the two keys can be combined - they're separated by an ASCII character that compares "less than" any ASCII digit (of course this is wholly specific to the precise kind of keys you're using).
This is typically a little faster than multisort()
for me, using the precise program you posted.
prepare=3.628943920135498 multisort=15.646344423294067 multikey=34.255955934524536 slowdown=2.1893903782103075 onekey=15.11461067199707
I believe the primary reason "why" is briefly explained at the end of a modern CPython distribution's Objects/listsort.txt
:
As noted above, even the simplest Python comparison triggers a large pile of C-level pointer dereferences, conditionals, and function calls. This can be partially mitigated by pre-scanning the data to determine whether the data is homogeneous with respect to type. If so, it is sometimes possible to substitute faster type-specific comparisons for the slower, generic PyObject_RichCompareBool.
When there's a single string used as the key, this pre-sorting scan deduces that all the keys in the list are in fact strings, so all the runtime expense of figuring out which comparison function to call can be skipped: sorting can always call the string-specific comparison function instead of the all-purpose (and significantly more expensive) PyObject_RichCompareBool
.
multisort()
also benefits from that optimization.
But multikey()
doesn't, much. The pre-sorting scan sees that all the keys are tuples, but the tuple comparison function itself can't assume anything about the types of the tuple's elements: it has to resort to PyObject_RichCompareBool
every time it's invoked. (Note: as touched on in comments, it's not really quite that simple: some optimization is still done exploiting that the keys are all tuples, but it doesn't always pay off, and at best is less effective - and see the next section for clearer evidence of that.)
There's a whole lot going on in the test case, which leads to needing ever-greater effort to explain ever-smaller distinctions.
So to look at the effects of the type homogeneity optimizations, let's simplify things a lot: no key
function at all. Like so:
from random import random, seed
from time import time
length = 10000000
seed(1234567891)
xs = [random() for _ in range(length)]
ys = xs[:]
start = time()
ys.sort()
e1 = time() - start
ys = [(x,) for x in xs]
start = time()
ys.sort()
e2 = time() - start
ys = [[x] for x in xs]
start = time()
ys.sort()
e3 = time() - start
print(e1, e2, e3)
Here's typical output on my box:
3.1991195678710938 12.756590843200684 26.31903386116028
So it's by far fastest to sort floats directly. It's already very damaging to stick the floats in 1-tuples, but the optimization still gives highly significant benefit: it takes over twice as long again to stick the floats in singleton lists. In that last case (and only in that last case), PyObject_RichCompareBool
is always called.
Tim and a_guest nudged me to analyze more. Here are counts of the different comparisons (per element) and other statistics:
length: 10,000 100,000 1,000,000 10,000,000
| msort mkey | msort mkey | msort mkey | msort mkey |
----------------------+-------------+-------------+-------------+-------------+
< ''.join(sorted(x)) | 11.64 11.01 | 13.86 11.88 | 13.10 12.00 | 12.16 12.06 |
< x[::2] | 11.99 0.96 | 13.51 3.20 | 13.77 5.35 | 13.79 6.68 |
== ''.join(sorted(x)) | 11.99 | 15.29 | 18.60 | 21.26 |
== x[::2] | 0.98 | 3.42 | 6.60 | 9.20 |
----------------------+-------------+-------------+-------------+-------------+
time, μs per element | 0.84 1.03 | 0.95 1.42 | 1.31 2.35 | 1.39 3.15 |
----------------------+-------------+-------------+-------------+-------------+
tracemalloc peak MiB | 0.82 1.67 | 8.26 17.71 | 82.7 178.0 | 826 1780 |
----------------------+-------------+-------------+-------------+-------------+
sys.getsizeof MiB | 0.58 1.80 | 5.72 17.82 | 57.6 179.5 | 581 1808 |
| 0.60 | 6.00 | 60.4 | 608 |
----------------------+-------------+-------------+-------------+-------------+
The numbers of comparisons are per element, i.e., I counted all comparisons and divided by the list size. So for example for a list with a million elements, multisort
did 13.77 <
comparisons per x[::2]
string and then 13.10 <
comparisons per ''.join(sorted(x))
string. Whereas multikey
has many ==
comparisons of the strings and fewer <
comparisons of the strings. As Tim pointed out, unsafe_tuple_compare
uses the slower PyObject_RichCompareBool(..., Py_EQ)
before it uses the faster unsafe_latin_compare
(on the ''.join(sorted(x))
strings) or another slower PyObject_RichCompareBool(..., Py_LT)
(on the x[::2]
strings).
A key thing to note is that multisort
's numbers of comparisons per element stay roughly constant and it only uses the faster unsafe_latin_compare
. Whereas multikey
's numbers of comparisons other than the <
comparisons on ''.join(sorted(x))
grow rapidly, including the additional slower equality comparisons.
That has to do with your data, as x
is the string representation of an integer from 0 to 1,000,000. That causes more and more duplicates, both for ''.join(sorted(x))
(which also has duplicates from the sorting, e.g., 314159 and 911345 both become '113459') and for x[::2]
(which also has duplicates from the slicing, e.g., 123456 and 183957 both become '135').
For multisort
this is good, as duplicates mean less work for it - equal things dont need to be "sorted further" (and I think streaks of equal things are good for Timsort's galloping).
But multikey
suffers from from the duplicates, because the ==
comparisons on ''.join(sorted(x))
result in "they're equal" more often, leading to more comparisons of the x[::2]
strings. And these x[::2]
often turn out unequal, note how the numbers for < x[::2]
are relatively large compared to the == x[::2]
comparisons (e.g., at ten million elements, there were 9.20 ==
comparisons and 6.68 <
comparisons, so 73% of the time they were unequal). Which means the tuples also often turn out unequal, so they do need to be "sorted further" (and I think it's bad for galloping). And this further sorting will compare whole tuples again, also meaning even more ==
comparisons of the ''.join(sorted(x))
strings even though they're equal! That's why in multikey
, the <
comparisons of the ''.join(sorted(x))
strings remain fairly stable (from 11.01 per string for 10,000 elements, up to only 12.06 for 10 million elements) while their ==
comparisons grow so much (from 11.99 up to 21.26).
The runtimes in the above table also reflect the little growth for multisort
and the large growth for multikey
. The one time where multisort
got really quite a bit slower was at the step from 100,000 to 1,000,000 elements, going from 0.95 μs to 1.31 μs per element. As can be seen from the tracemalloc
and sys.getsizeof
rows of my table, its memory stepped up from ~6 MiB to ~59 MiB, and the CPU has about 34 MiB cache, so cache misses might be responsible for that.
For the sys.getsizeof
values I put all keys into a list and add the size of the list (since sorted
also stores them) and the sizes of all strings/tuples in the list. For multisort
, my table shows two values separately, as the two sorts happen one after the other, and the memory for the first sort is released before the second sort.
(Note: runtimes differ from my original answer as I had to use a different computer to sort ten million elements.)
I wrap the strings in an S
object that increases a counter for each comparison.
import random
from collections import Counter
class S:
def __init__(self, value, label):
self.value = value
self.label = label
def __eq__(self, other):
comparisons['==', self.label] += 1
return self.value == other.value
def __ne__(self, other):
comparisons['!=', self.label] += 1
return self.value != other.value
def __lt__(self, other):
comparisons['<', self.label] += 1
return self.value < other.value
def __le__(self, other):
comparisons['<=', self.label] += 1
return self.value <= other.value
def __ge__(self, other):
comparisons['>=', self.label] += 1
return self.value >= other.value
def __gt__(self, other):
comparisons['>', self.label] += 1
return self.value > other.value
def key1(x):
return S(''.join(sorted(x)), "''.join(sorted(x))")
def key2(x):
return S(x[::2], 'x[::2]')
def multisort(l):
tmp = sorted(l, key=lambda x: key2(x))
return sorted(tmp, key=lambda x: key1(x))
def multikey(l):
return sorted(l, key=lambda x: (key1(x), key2(x)))
funcs = [
multisort,
multikey,
]
def test(length, rounds):
print(f'{length = :,}')
largest = 1000000
for _ in range(3):
lst = list(map(str, random.choices(range(largest + 1), k=length)))
for func in funcs:
global comparisons
comparisons = Counter()
func(lst)
print(func.__name__)
for comparison, frequency in comparisons.most_common():
print(f'{frequency / length:5.2f}', *comparison)
print()
test(10_000, 1)
test(100_000, 10)
test(1_000_000, 1)
test(10_000_000, 1)
Yeah, I've often used / pointed out that two simpler sorts can be faster than one more complex sort. Here are benchmarks with a few more versions:
At length = 10,000
elements, multikey
took about 1.16 times as long as multisort
:
multisort 12.09 ms 12.21 ms 12.32 ms (no previous result)
multikey 14.13 ms 14.14 ms 14.14 ms (same as previous result)
multisort_tupled 15.40 ms 15.61 ms 15.70 ms (same as previous result)
multisort_inplaced 11.46 ms 11.49 ms 11.49 ms (same as previous result)
At length = 100,000
, it took about 1.43 times as long:
length = 100,000
multisort 0.162 s 0.164 s 0.164 s (no previous result)
multikey 0.231 s 0.233 s 0.237 s (same as previous result)
multisort_tupled 0.222 s 0.225 s 0.227 s (same as previous result)
multisort_inplaced 0.156 s 0.157 s 0.158 s (same as previous result)
At length = 1,000,000
, it took about 2.15 times as long:
multisort 1.894 s 1.897 s 1.898 s (no previous result)
multikey 4.026 s 4.060 s 4.163 s (same as previous result)
multisort_tupled 2.734 s 2.765 s 2.771 s (same as previous result)
multisort_inplaced 1.840 s 1.841 s 1.863 s (same as previous result)
Reasons I see:
multisort_tupled
, where in one of the two sorts I wrap each real key in a single-value tuple. That made it much slower.Btw, notice my multisort_inplaced
is a bit faster still. If you do multiple sorts, it's pointless to do them all with sorted
. After the first one, just use inplace sort
. No reason to create further new lists, which takes time/memory for the lists and takes time to update the reference counts for all the list elements.
Benchmark code (Try it online!):
import random
from time import perf_counter as timer
def multisort(l):
tmp = sorted(l, key=lambda x: x[::2])
return sorted(tmp, key=lambda x: ''.join(sorted(x)))
def multikey(l):
return sorted(l, key=lambda x: (''.join(sorted(x)), x[::2]))
def multisort_tupled(l):
tmp = sorted(l, key=lambda x: x[::2])
return sorted(tmp, key=lambda x: (''.join(sorted(x)),))
def multisort_inplaced(l):
tmp = sorted(l, key=lambda x: x[::2])
tmp.sort(key=lambda x: ''.join(sorted(x)))
return tmp
funcs = [
multisort,
multikey,
multisort_tupled,
multisort_inplaced,
]
def test(length, rounds):
print(f'{length = :,}')
largest = 1000000
lst = list(map(str, random.choices(range(largest + 1), k=length)))
prev = None
for func in funcs:
times = []
for _ in range(5):
time = 0
for _ in range(rounds):
copy = lst.copy()
start = timer()
result = func(copy)
time += timer() - start
times.append(time / rounds)
print('%-19s' % func.__name__,
*('%5.2f ms ' % (t * 1e3) for t in sorted(times)[:3]),
# *('%5.3f s ' % t for t in sorted(times)[:3]),
'(%s previous result)' % ('no' if prev is None else 'same as' if result == prev else 'differs from'))
prev = result
test(10_000, 10)
# test(100_000, 10)
# test(1_000_000, 1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With