I'm trying to efficiently change:
[{'text': 'hallo world', 'num': 1},
{'text': 'hallo world', 'num': 2},
{'text': 'hallo world', 'num': 1},
{'text': 'haltlo world', 'num': 1},
{'text': 'hallo world', 'num': 1},
{'text': 'hallo world', 'num': 1},
{'text': 'hallo world', 'num': 1}]
into a list of dictionaries without duplicates and a count of duplicates:
[{'text': 'hallo world', 'num': 2, 'count':1},
{'text': 'hallo world', 'num': 1, 'count':5},
{'text': 'haltlo world', 'num': 1, 'count':1}]
So far, I have the following to find duplicates:
result = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in li)]
and it returns:
[{'text': 'hallo world', 'num': 2},
{'text': 'hallo world', 'num': 1},
{'text': 'haltlo world', 'num': 1}]
THANKS!
I'll use one of my favourites from itertools
:
from itertools import groupby
def canonicalize_dict(x):
"Return a (key, value) list sorted by the hash of the key"
return sorted(x.items(), key=lambda x: hash(x[0]))
def unique_and_count(lst):
"Return a list of unique dicts with a 'count' key added"
grouper = groupby(sorted(map(canonicalize_dict, lst)))
return [dict(k + [("count", len(list(g)))]) for k, g in grouper]
a = [{'text': 'hallo world', 'num': 1},
#....
{'text': 'hallo world', 'num': 1}]
print unique_and_count(a)
Output
[{'count': 5, 'text': 'hallo world', 'num': 1},
{'count': 1, 'text': 'hallo world', 'num': 2},
{'count': 1, 'text': 'haltlo world', 'num': 1}]
As gnibbler points out, d1.items()
and d2.items()
may have different key-ordering, even if the keys are identical, so I've introduced the canonical_dict
function to address this concern.
Note: This now uses frozenset
which means that the items in the dictionary must be hashable.
>>> from collections import defaultdict
>>> from itertools import chain
>>> data = [{'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 2}, {'text': 'hallo world', 'num': 1}, {'text': 'haltlo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 1}]
>>> c = defaultdict(int)
>>> for d in data:
c[frozenset(d.iteritems())] += 1
>>> [dict(chain(k, (('count', count),))) for k, count in c.iteritems()]
[{'count': 1, 'text': 'haltlo world', 'num': 1}, {'count': 1, 'text': 'hallo world', 'num': 2}, {'count': 5, 'text': 'hallo world', 'num': 1}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With