Given a dictionary with three layers of keys, what's the fastest way to sum the values? Here's my current approach:
from collections import defaultdict
dicts = [ {'a':{'b':{'c':1}}}, {'a':{'b':{'c':4, 'e':3}}} ]
def sum_three_deep_dict_values(dicts):
'''Read in two dicts and return a dictionary that contains their outer-joined keys and value sums'''
combined = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
for d in dicts:
for w1, val_dict in d.iteritems():
for w2 in val_dict.iterkeys():
for w3 in val_dict[w2].iterkeys():
combined[w1][w2][w3] += d[w1][w2][w3]
return combined
print sum_three_deep_dict_values(dicts)
Here the expected output is {'a': {'b': {'c': 5, 'e': 3}}}
The goal is to sum the values for which both dictionaries have the same keys (such as d[a][b][c]
here) and include the remaining key value pairs from either dictionary in the output dictionary.
There are a number of questions on SO that appear to answer the question: "How should one sum the values of nested dictionaries"? Reading through them last night, however, every one I found involved some strange special case or parameter, like "combine/ignore the n-th layer of keys", or "apply an if condition in the special place". I therefore wanted to raise the plain question: What's the best way to sum the values of double-nested dictionaries in Python?
I think your current approach is, in general, a good one. My suggestion would be to eliminate as many dictionary lookups as possible. Iterating over keys and values together should be as fast as iterating over just keys, so you might as well combine them. And the final call to d[w1][w2][w3]
isn't necessary if you do that, nor is the interim key lookup. So something like this:
def sum_three_deep_dict_values(dicts):
'''Read in two dicts and return a dictionary that contains
their outer-joined keys and value sums'''
combined = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
for layer0 in dicts:
for k1, layer1 in layer0.iteritems():
for k2, layer2 in layer1.iteritems():
for k3, count in layer2.iteritems():
combined[k1][k2][k3] += count
return combined
I took the liberty of changing your name scheme slightly.
If you're still worried about speed after testing the above, you may need to look into other data structures or third-party libraries. But before you do that, try PyPy -- I find it often gives at least a 4x speedup on vanilla for
loops.
Also, do test this against your original code. I think my reasoning above holds, but it's still a bit conjectural. I'm curious about others' suggestions too. At the scale you're working, this could be a challenge! (Out of curiosity, how long is this taking you with your current code?)
UPDATE: I tested this and it is indeed faster, although only by a hair:
>>> %timeit sum_three_deep_original(dicts)
1000 loops, best of 3: 1.38 ms per loop
>>> %timeit sum_three_deep_edited(dicts)
1000 loops, best of 3: 1.26 ms per loop
I'm guessing you need more speed for your application. I tried it with PyPy, and I also compiled it using cython (but without any modifications or type annotations). PyPy wins with a 66% speedup. Plain python again (with slightly different parameters this time):
:~ $ python -c 'from tdsum import test; test()'
1.63905096054
Compiled with cython:
:~ $ python -c 'from tdsum import test; test()'
1.224848032
And using PyPy:
:~ $ pypy -c 'from tdsum import test; test()'
0.427165031433
I would expect a real cython version using a custom-built data structure to outperform PyPy significantly. The problem is that you can't use dict
s and still get the iteration speedup you'd like, because cython has to muck about with Python object overhead. So you'd have to implement your own hash table!
I've often wondered why cython doesn't provide a solution to this problem; perhaps there's a numpy
type out there that would be usable. I'll keep looking!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With