A generator expression is throwing off a large number of tuple pairs eg. in list form:
pairs = [(3, 47), (6, 47), (9, 47), (6, 27), (11, 27), (23, 27), (41, 27), (4, 67), (9, 67), (11, 67), (33, 67)]
For each pair in pairs, with key = pair[0] and value = pair[1], I want to feed this stream of pairs into a dictionary to cumulatively add the values for the respective keys. The obvious solution is:
dict_k_v = {}
for pair in pairs:
try:
dict_k_v[pair[0]] += pair[1]
except:
dict_k_v[pair[0]] = pair[1]
>>> dict_k_v
{33: 67, 3: 47, 4: 67, 6: 74, 9: 114, 11: 94, 41: 27, 23: 27}
However, could this be achieved with a generator expression or some similar construct that doesn't use a for-loop?
EDIT
To clarify, the generator expression is throwing off a large number of tuple pairs:
(3, 47), (6, 47), (9, 47), (6, 27), (11, 27), (23, 27), (41, 27), (4, 67), (9, 67), (11, 67), (33, 67) ...
and I want to accumulate each key-value pair into a dictionary (see Paul McGuire's answer) as each pair is being generated. The pairs = list[] statement was unnecessary and sorry about that. For each pair (x,y), x is an integer and y can be an integer or decimal/float.
My generator expression is of the form:
((x,y) for y in something() for x in somethingelse())
and want to accumulate each (x,y) pair into a defaultdict. Hth.
For discussion, here is a simple generator function to give us some data:
from random import randint
def generator1():
for i in range(10000):
yield (randint(1,10), randint(1,100))
And here is the basic solution that uses a Python for-loop to consume the generator and tally up counts for each key-value pair
from collections import defaultdict
tally = defaultdict(int)
for k,v in generator1():
tally[k] += v
for k in sorted(tally):
print k, tally[k]
Will print something like:
1 49030
2 51963
3 51396
4 49292
5 51908
6 49481
7 49645
8 49149
9 48523
10 50722
But we can create a coroutine that will accept each key-value pair sent to it, and accumulate them all into a defaultdict passed into it:
# define coroutine to update defaultdict for every
# key,value pair sent to it
def tallyAccumulator(t):
try:
while True:
k,v = (yield)
t[k] += v
except GeneratorExit:
pass
We'll initialize the coroutine with a tally defaultdict, and get it ready to accept values by sending a None value to it:
# init coroutine
tally = defaultdict(int)
c = tallyAccumulator(tally)
c.send(None)
We could use a for loop or a list comprehension to send all of the generator values to the coroutine:
for val in generator1():
c.send(val)
or
[c.send(val) for val in generator1()]
But instead, we'll use a zero-sized deque to process all the generator expression's values without creating an unnecessary temporary list of None's:
# create generator expression consumer
from collections import deque
do_all = deque(maxlen=0).extend
# loop thru generator at C speed, instead of Python for-loop speed
do_all(c.send(val) for val in generator1())
Now we look at the values again:
for k in sorted(tally):
print k, tally[k]
And we get another list similar to the first one:
1 52236
2 49139
3 51848
4 51194
5 51275
6 50012
7 51875
8 46013
9 50955
10 52192
Read more about coroutines at David Beazley's page: http://www.dabeaz.com/coroutines/
You can use tuple destructuring and a defaultdict
to shorten that loop a lot:
from collections import defaultdict
d = defaultdict(int)
for k,v in pairs: d[k] += v
This still uses a for-loop, but you don't have to handle the case where a key hasn't been seen before. I think this is probably the best solution, both readability-wise and performance-wise.
groupby
That said, you could do it using itertools.groupby
, but it's a bit of a hack:
import itertools
dict((k, sum(v for k,v in group)) for k, group
in itertools.groupby(sorted(pairs), lambda (k,v): k))
Also, this should actually be less performant than the first approach, because an in-memory list of all the pairs needs to be created for the sorting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With