Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Sum the Values of Three Layer Dictionaries

Given a dictionary with three layers of keys, what's the fastest way to sum the values? Here's my current approach:

from collections import defaultdict

dicts = [ {'a':{'b':{'c':1}}}, {'a':{'b':{'c':4, 'e':3}}} ]

def sum_three_deep_dict_values(dicts):
    '''Read in two dicts and return a dictionary that contains their outer-joined keys and value sums'''
    combined = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
    for d in dicts:
        for w1, val_dict in d.iteritems():          
            for w2 in val_dict.iterkeys():              
                for w3 in val_dict[w2].iterkeys():
                    combined[w1][w2][w3] += d[w1][w2][w3]
    return combined

print sum_three_deep_dict_values(dicts)

Here the expected output is {'a': {'b': {'c': 5, 'e': 3}}} The goal is to sum the values for which both dictionaries have the same keys (such as d[a][b][c] here) and include the remaining key value pairs from either dictionary in the output dictionary.

There are a number of questions on SO that appear to answer the question: "How should one sum the values of nested dictionaries"? Reading through them last night, however, every one I found involved some strange special case or parameter, like "combine/ignore the n-th layer of keys", or "apply an if condition in the special place". I therefore wanted to raise the plain question: What's the best way to sum the values of double-nested dictionaries in Python?

like image 538
duhaime Avatar asked May 19 '15 12:05

duhaime


1 Answers

I think your current approach is, in general, a good one. My suggestion would be to eliminate as many dictionary lookups as possible. Iterating over keys and values together should be as fast as iterating over just keys, so you might as well combine them. And the final call to d[w1][w2][w3] isn't necessary if you do that, nor is the interim key lookup. So something like this:

def sum_three_deep_dict_values(dicts):
    '''Read in two dicts and return a dictionary that contains 
       their outer-joined keys and value sums'''
    combined = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
    for layer0 in dicts:
        for k1, layer1 in layer0.iteritems():
            for k2, layer2 in layer1.iteritems():
                for k3, count in layer2.iteritems():
                    combined[k1][k2][k3] += count
    return combined

I took the liberty of changing your name scheme slightly.

If you're still worried about speed after testing the above, you may need to look into other data structures or third-party libraries. But before you do that, try PyPy -- I find it often gives at least a 4x speedup on vanilla for loops.

Also, do test this against your original code. I think my reasoning above holds, but it's still a bit conjectural. I'm curious about others' suggestions too. At the scale you're working, this could be a challenge! (Out of curiosity, how long is this taking you with your current code?)

UPDATE: I tested this and it is indeed faster, although only by a hair:

>>> %timeit sum_three_deep_original(dicts)
1000 loops, best of 3: 1.38 ms per loop
>>> %timeit sum_three_deep_edited(dicts)
1000 loops, best of 3: 1.26 ms per loop

I'm guessing you need more speed for your application. I tried it with PyPy, and I also compiled it using cython (but without any modifications or type annotations). PyPy wins with a 66% speedup. Plain python again (with slightly different parameters this time):

:~ $ python -c 'from tdsum import test; test()'
1.63905096054

Compiled with cython:

:~ $ python -c 'from tdsum import test; test()'
1.224848032

And using PyPy:

:~ $ pypy -c 'from tdsum import test; test()'
0.427165031433

I would expect a real cython version using a custom-built data structure to outperform PyPy significantly. The problem is that you can't use dicts and still get the iteration speedup you'd like, because cython has to muck about with Python object overhead. So you'd have to implement your own hash table!

I've often wondered why cython doesn't provide a solution to this problem; perhaps there's a numpy type out there that would be usable. I'll keep looking!

like image 92
senderle Avatar answered Oct 19 '22 22:10

senderle