I've been creating huge dicts (millions of entries) and I've noticed that if I create them with the keys in order it is much faster.
I imagine that it has something to do with collisions with the hash function, but can someone explain why is it happening and if it is consistent among versions of python?
Here you have an artificial example:
import timeit
import random
def get_test_data(num, size):
olist, ulist = [], []
for _ in range(num):
otest = [str(i) for i in range(size)]
utest = list(otest)
random.shuffle(utest)
olist.append(otest)
ulist.append(utest)
return olist, ulist
NUM_TESTS = 20
# Precalculate the test data so we only measure dict creation time
ordered, unordered = get_test_data(NUM_TESTS, 1000000)
def test_ordered():
dict((k, k) for k in ordered.pop())
def test_unordered():
dict((k, k) for k in unordered.pop())
print "unordered: ",
print timeit.timeit("test_unordered()",
setup="from __main__ import test_unordered, test_ordered",
number=NUM_TESTS)
print "ordered: ",
print timeit.timeit("test_ordered()",
setup="from __main__ import test_unordered, test_ordered",
number=NUM_TESTS)
The output in my machine consistently is:
(X)$ python /tmp/test.py
unordered: 8.60760807991
ordered: 5.1214389801
I'm using Python 2.7.3 in ubuntu precise x86_64
I'm almost certain this is what's going on: when you first create otest
, the strings are being stored in order in memory. When you create utest
, the strings point to the same memory buffers, except that now those locations are out of order, which kills cache performance on the unordered test cases.
Here's the evidence. I've replaced your get_test_data
function with this version:
def get_test_data(num, size):
olist, ulist = [], []
for _ in range(num):
nums = range(size)
random.shuffle(nums)
utest = [str(i) for i in nums]
otest = list(utest)
otest.sort(key=lambda x: int(x))
olist.append(otest)
ulist.append(utest)
return olist, ulist
The idea is that I'm now constructing the strings in ulist
consecutively in memory, then building olist
by sorting those strings with the appropriate key. On my machine, this reverses the running times of the two tests.
Checking the source code of the python dict you can see that consecutive strings or ints gives less collisions. This combined with @skishore comment about better cache locallity could be the answer.
Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:
>>> map(hash, (0, 1, 2, 3)) [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462] >>>
This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With