Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Elegantly merge dictionaries with sum() of values [duplicate]

I'm trying to merge logs from several servers. Each log is a list of tuples (date, count). date may appear more than once, and I want the resulting dictionary to hold the sum of all counts from all servers.

Here's my attempt, with some data for example:

from collections import defaultdict  a=[("13.5",100)] b=[("14.5",100), ("15.5", 100)] c=[("15.5",100), ("16.5", 100)] input=[a,b,c]  output=defaultdict(int) for d in input:         for item in d:            output[item[0]]+=item[1] print dict(output) 

Which gives:

{'14.5': 100, '16.5': 100, '13.5': 100, '15.5': 200} 

As expected.

I'm about to go bananas because of a colleague who saw the code. She insists that there must be a more Pythonic and elegant way to do it, without these nested for loops. Any ideas?

like image 790
Adam Matan Avatar asked Jul 02 '12 08:07

Adam Matan


2 Answers

Doesn't get simpler than this, I think:

a=[("13.5",100)] b=[("14.5",100), ("15.5", 100)] c=[("15.5",100), ("16.5", 100)] input=[a,b,c]  from collections import Counter  print sum(     (Counter(dict(x)) for x in input),     Counter()) 

Note that Counter (also known as a multiset) is the most natural data structure for your data (a type of set to which elements can belong more than once, or equivalently - a map with semantics Element -> OccurrenceCount. You could have used it in the first place, instead of lists of tuples.


Also possible:

from collections import Counter from operator import add  print reduce(add, (Counter(dict(x)) for x in input)) 

Using reduce(add, seq) instead of sum(seq, initialValue) is generally more flexible and allows you to skip passing the redundant initial value.

Note that you could also use operator.and_ to find the intersection of the multisets instead of the sum.


The above variant is terribly slow, because a new Counter is created on every step. Let's fix that.

We know that Counter+Counter returns a new Counter with merged data. This is OK, but we want to avoid extra creation. Let's use Counter.update instead:

update(self, iterable=None, **kwds) unbound collections.Counter method

Like dict.update() but add counts instead of replacing them. Source can be an iterable, a dictionary, or another Counter instance.

That's what we want. Let's wrap it with a function compatible with reduce and see what happens.

def updateInPlace(a,b):     a.update(b)     return a  print reduce(updateInPlace, (Counter(dict(x)) for x in input)) 

This is only marginally slower than the OP's solution.

Benchmark: http://ideone.com/7IzSx (Updated with yet another solution, thanks to astynax)

(Also: If you desperately want an one-liner, you can replace updateInPlace by lambda x,y: x.update(y) or x which works the same way and even proves to be a split second faster, but fails at readability. Don't :-))

like image 94
Kos Avatar answered Sep 21 '22 06:09

Kos


from collections import Counter   a = [("13.5",100)] b = [("14.5",100), ("15.5", 100)] c = [("15.5",100), ("16.5", 100)]  inp = [dict(x) for x in (a,b,c)] count = Counter() for y in inp:   count += Counter(y) print(count) 

output:

Counter({'15.5': 200, '14.5': 100, '16.5': 100, '13.5': 100}) 

Edit: As duncan suggested you can replace these 3 lines with a single line:

   count = Counter()     for y in inp:       count += Counter(y) 

replace by : count = sum((Counter(y) for y in inp), Counter())

like image 43
Ashwini Chaudhary Avatar answered Sep 20 '22 06:09

Ashwini Chaudhary