I have a problem with counting distinct values for each key in Python.
I have a dictionary d like
[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
I need to print number of distinct values per each key individually.
That means I would want to print
abc 3
xyz 1
pqr 4
Please help.
Thank you
In this, unique values is extracted using set(), len() is used to get its count, and then the result is mapped to each key extracted using keys().
In dictionaries, the items are stored in the form of key-value pairs which means that the total number of items and keys are equal. Therefore, when len() function is applied to the dictionary, it returns the total number of keys.
If you want to count the occurrences of each value in a Python dictionary, you can use the collections. Counter() function on the dictionary values. It returns the number of times each value occurs in the dictionary.
Over 6 years after answering, someone pointed out to me I misread the question. While my original answer (below) counts unique keys in the input sequence, you actually have a different count-distinct problem; you want to count values per key.
To count unique values per key, exactly, you'd have to collect those values into sets first:
values_per_key = {}
for d in iterable_of_dicts:
for k, v in d.items():
values_per_key.setdefault(k, set()).add(v)
counts = {k: len(v) for k, v in values_per_key.items()}
which for your input, produces:
>>> values_per_key = {}
>>> for d in iterable_of_dicts:
... for k, v in d.items():
... values_per_key.setdefault(k, set()).add(v)
...
>>> counts = {k: len(v) for k, v in values_per_key.items()}
>>> counts
{'abc': 3, 'xyz': 1, 'pqr': 4}
We can still wrap that object in a Counter()
instance if you want to make use of the additional functionality this class offers, see below:
>>> from collections import Counter
>>> Counter(counts)
Counter({'pqr': 4, 'abc': 3, 'xyz': 1})
The downside is that if your input iterable is very large the above approach can require a lot of memory. In case you don't need exact counts, e.g. when orders of magnitude suffice, there are other approaches, such as a hyperloglog structure or other algorithms that 'sketch out' a count for the stream.
This approach requires you install a 3rd-party library. As an example, the datasketch
project offers both HyperLogLog and MinHash. Here's a HLL example (using the HyperLogLogPlusPlus
class, which is a recent improvement to the HLL approach):
from collections import defaultdict
from datasketch import HyperLogLogPlusPlus
counts = defaultdict(HyperLogLogPlusPlus)
for d in iterable_of_dicts:
for k, v in d.items():
counts[k].update(v.encode('utf8'))
In a distributed setup, you could use Redis to manage the HLL counts.
My original answer:
Use a collections.Counter()
instance, together with some chaining:
from collections import Counter
from itertools import chain
counts = Counter(chain.from_iterable(e.keys() for e in d))
This ensures that dictionaries with more than one key in your input list are counted correctly.
Demo:
>>> from collections import Counter
>>> from itertools import chain
>>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
or with multiple keys in the input dictionaries:
>>> d = [{"abc":"movies", 'xyz': 'music', 'pqr': 'music'}, {"abc": "sports", 'pqr': 'movies'}, {"abc": "music", 'pqr': 'sports'}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
A Counter()
has additional, helpful functionality, such as the .most_common()
method that lists elements and their counts in reverse sorted order:
for key, count in counts.most_common():
print '{}: {}'.format(key, count)
# prints
# 5: pqr
# 3: abc
# 1: xyz
No need of using counter. You can achieve in this way:
# input dictionary
d=[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
# fetch keys
b=[j[0] for i in d for j in i.items()]
# print output
for k in list(set(b)):
print "{0}: {1}".format(k, b.count(k))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With