Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python dictionary count of unique values

I have a problem with counting distinct values for each key in Python.

I have a dictionary d like

[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]

I need to print number of distinct values per each key individually.

That means I would want to print

abc 3
xyz 1
pqr 4

Please help.

Thank you

like image 561
user1189851 Avatar asked May 06 '13 20:05

user1189851


People also ask

How do you count unique values in a dictionary in Python?

In this, unique values is extracted using set(), len() is used to get its count, and then the result is mapped to each key extracted using keys().

How do you count the number of unique keys in a dictionary?

In dictionaries, the items are stored in the form of key-value pairs which means that the total number of items and keys are equal. Therefore, when len() function is applied to the dictionary, it returns the total number of keys.

How do you count occurrences of a value in a dictionary Python?

If you want to count the occurrences of each value in a Python dictionary, you can use the collections. Counter() function on the dictionary values. It returns the number of times each value occurs in the dictionary.


2 Answers

Over 6 years after answering, someone pointed out to me I misread the question. While my original answer (below) counts unique keys in the input sequence, you actually have a different count-distinct problem; you want to count values per key.

To count unique values per key, exactly, you'd have to collect those values into sets first:

values_per_key = {}
for d in iterable_of_dicts:
    for k, v in d.items():
        values_per_key.setdefault(k, set()).add(v)
counts = {k: len(v) for k, v in values_per_key.items()}

which for your input, produces:

>>> values_per_key = {}
>>> for d in iterable_of_dicts:
...     for k, v in d.items():
...         values_per_key.setdefault(k, set()).add(v)
...
>>> counts = {k: len(v) for k, v in values_per_key.items()}
>>> counts
{'abc': 3, 'xyz': 1, 'pqr': 4}

We can still wrap that object in a Counter() instance if you want to make use of the additional functionality this class offers, see below:

>>> from collections import Counter
>>> Counter(counts)
Counter({'pqr': 4, 'abc': 3, 'xyz': 1})

The downside is that if your input iterable is very large the above approach can require a lot of memory. In case you don't need exact counts, e.g. when orders of magnitude suffice, there are other approaches, such as a hyperloglog structure or other algorithms that 'sketch out' a count for the stream.

This approach requires you install a 3rd-party library. As an example, the datasketch project offers both HyperLogLog and MinHash. Here's a HLL example (using the HyperLogLogPlusPlus class, which is a recent improvement to the HLL approach):

from collections import defaultdict
from datasketch import HyperLogLogPlusPlus

counts = defaultdict(HyperLogLogPlusPlus)

for d in iterable_of_dicts:
    for k, v in d.items():
        counts[k].update(v.encode('utf8'))

In a distributed setup, you could use Redis to manage the HLL counts.


My original answer:

Use a collections.Counter() instance, together with some chaining:

from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(e.keys() for e in d))

This ensures that dictionaries with more than one key in your input list are counted correctly.

Demo:

>>> from collections import Counter
>>> from itertools import chain
>>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})

or with multiple keys in the input dictionaries:

>>> d = [{"abc":"movies", 'xyz': 'music', 'pqr': 'music'}, {"abc": "sports", 'pqr': 'movies'}, {"abc": "music", 'pqr': 'sports'}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})

A Counter() has additional, helpful functionality, such as the .most_common() method that lists elements and their counts in reverse sorted order:

for key, count in counts.most_common():
    print '{}: {}'.format(key, count)

# prints
# 5: pqr
# 3: abc
# 1: xyz
like image 98
Martijn Pieters Avatar answered Sep 24 '22 15:09

Martijn Pieters


No need of using counter. You can achieve in this way:

# input dictionary
d=[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]

# fetch keys
b=[j[0] for i in d for j in i.items()]

# print output
for k in list(set(b)):
    print "{0}: {1}".format(k, b.count(k))
like image 43
akashdeep Avatar answered Sep 23 '22 15:09

akashdeep