I have a dataframe with a column which has dictionaries in it. I want to count the number of occurrences of a dictionary key in the whole column.
One way of doing it is below:
import pandas as pd
from collections import Counter
df = pd.DataFrame({"data": [{"weight": 3, "color": "blue"},
{"size": 5, "weight": 2},{"size": 3, "color": "red"}]})
c = Counter()
for index, row in df.iterrows():
for item in list(row["data"].keys()):
c[item] += 1
print(c)
Which gives
Counter({'weight': 2, 'color': 2, 'size': 2})
Are there faster ways of doing it?
A much faster approach would be to flatten the column with itertools.chain and build a Counter from the result (which will only contain the dictionary keys):
from itertools import chain
Counter(chain.from_iterable(df.data.values.tolist()))
# Counter({'weight': 2, 'color': 2, 'size': 2})
Timings:
def OP(df):
c = Counter()
for index, row in df.iterrows():
for item in list(row["data"].keys()):
c[item] += 1
%timeit OP(df)
# 570 µs ± 49.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit Counter(chain.from_iterable(df.data.values.tolist()))
# 14.2 µs ± 902 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With