Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Generator Expression for Accumulating Dictionary Values

A generator expression is throwing off a large number of tuple pairs eg. in list form:

pairs = [(3, 47), (6, 47), (9, 47), (6, 27), (11, 27), (23, 27), (41, 27), (4, 67), (9, 67), (11, 67), (33, 67)]

For each pair in pairs, with key = pair[0] and value = pair[1], I want to feed this stream of pairs into a dictionary to cumulatively add the values for the respective keys. The obvious solution is:

dict_k_v = {}
for pair in pairs:
    try:
        dict_k_v[pair[0]] += pair[1]
    except:
        dict_k_v[pair[0]] = pair[1]

>>> dict_k_v
{33: 67, 3: 47, 4: 67, 6: 74, 9: 114, 11: 94, 41: 27, 23: 27}

However, could this be achieved with a generator expression or some similar construct that doesn't use a for-loop?

EDIT

To clarify, the generator expression is throwing off a large number of tuple pairs:

(3, 47), (6, 47), (9, 47), (6, 27), (11, 27), (23, 27), (41, 27), (4, 67), (9, 67), (11, 67), (33, 67) ...

and I want to accumulate each key-value pair into a dictionary (see Paul McGuire's answer) as each pair is being generated. The pairs = list[] statement was unnecessary and sorry about that. For each pair (x,y), x is an integer and y can be an integer or decimal/float.

My generator expression is of the form:

((x,y) for y in something() for x in somethingelse())

and want to accumulate each (x,y) pair into a defaultdict. Hth.

like image 214
Henry Thornton Avatar asked Feb 14 '12 23:02

Henry Thornton


2 Answers

For discussion, here is a simple generator function to give us some data:

from random import randint
def generator1():
    for i in range(10000):
        yield (randint(1,10), randint(1,100))

And here is the basic solution that uses a Python for-loop to consume the generator and tally up counts for each key-value pair

from collections import defaultdict

tally = defaultdict(int)
for k,v in generator1():
    tally[k] += v

for k in sorted(tally):
    print k, tally[k]

Will print something like:

1 49030
2 51963
3 51396
4 49292
5 51908
6 49481
7 49645
8 49149
9 48523
10 50722

But we can create a coroutine that will accept each key-value pair sent to it, and accumulate them all into a defaultdict passed into it:

# define coroutine to update defaultdict for every
# key,value pair sent to it
def tallyAccumulator(t):
    try:
        while True:
            k,v = (yield)
            t[k] += v
    except GeneratorExit:
        pass

We'll initialize the coroutine with a tally defaultdict, and get it ready to accept values by sending a None value to it:

# init coroutine
tally = defaultdict(int)
c = tallyAccumulator(tally)
c.send(None)

We could use a for loop or a list comprehension to send all of the generator values to the coroutine:

for val in generator1():
    c.send(val)

or

[c.send(val) for val in generator1()]

But instead, we'll use a zero-sized deque to process all the generator expression's values without creating an unnecessary temporary list of None's:

# create generator expression consumer
from collections import deque
do_all = deque(maxlen=0).extend

# loop thru generator at C speed, instead of Python for-loop speed
do_all(c.send(val) for val in generator1())

Now we look at the values again:

for k in sorted(tally):
    print k, tally[k]

And we get another list similar to the first one:

1 52236
2 49139
3 51848
4 51194
5 51275
6 50012
7 51875
8 46013
9 50955
10 52192

Read more about coroutines at David Beazley's page: http://www.dabeaz.com/coroutines/

like image 93
PaulMcG Avatar answered Nov 02 '22 06:11

PaulMcG


You can use tuple destructuring and a defaultdict to shorten that loop a lot:

from collections import defaultdict
d = defaultdict(int)
for k,v in pairs: d[k] += v

This still uses a for-loop, but you don't have to handle the case where a key hasn't been seen before. I think this is probably the best solution, both readability-wise and performance-wise.

Proof of concept using groupby

That said, you could do it using itertools.groupby, but it's a bit of a hack:

import itertools
dict((k, sum(v for k,v in group)) for k, group 
     in itertools.groupby(sorted(pairs), lambda (k,v): k))

Also, this should actually be less performant than the first approach, because an in-memory list of all the pairs needs to be created for the sorting.

like image 38
Niklas B. Avatar answered Nov 02 '22 06:11

Niklas B.