Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why inserting keys in order into a python dict is faster than doint it unordered

I've been creating huge dicts (millions of entries) and I've noticed that if I create them with the keys in order it is much faster.

I imagine that it has something to do with collisions with the hash function, but can someone explain why is it happening and if it is consistent among versions of python?

Here you have an artificial example:

import timeit
import random

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        otest = [str(i) for i in range(size)]
        utest = list(otest)
        random.shuffle(utest)
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

NUM_TESTS = 20
# Precalculate the test data so we only measure dict creation time
ordered, unordered = get_test_data(NUM_TESTS, 1000000)

def test_ordered():
    dict((k, k) for k in ordered.pop())

def test_unordered():
    dict((k, k) for k in unordered.pop())

print "unordered: ",
print timeit.timeit("test_unordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)
print "ordered: ",
print timeit.timeit("test_ordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)

The output in my machine consistently is:

(X)$ python /tmp/test.py 
unordered:  8.60760807991
ordered:  5.1214389801

I'm using Python 2.7.3 in ubuntu precise x86_64

like image 500
barracel Avatar asked Aug 14 '13 06:08

barracel


2 Answers

I'm almost certain this is what's going on: when you first create otest, the strings are being stored in order in memory. When you create utest, the strings point to the same memory buffers, except that now those locations are out of order, which kills cache performance on the unordered test cases.

Here's the evidence. I've replaced your get_test_data function with this version:

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        nums = range(size)
        random.shuffle(nums)
        utest = [str(i) for i in nums]
        otest = list(utest)
        otest.sort(key=lambda x: int(x))
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

The idea is that I'm now constructing the strings in ulist consecutively in memory, then building olist by sorting those strings with the appropriate key. On my machine, this reverses the running times of the two tests.

like image 89
disatisfieddinosaur Avatar answered Nov 07 '22 02:11

disatisfieddinosaur


Checking the source code of the python dict you can see that consecutive strings or ints gives less collisions. This combined with @skishore comment about better cache locallity could be the answer.

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>

This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

like image 24
barracel Avatar answered Nov 07 '22 01:11

barracel