I am trying to count the frequency of elements in a column of a pandas DataFrame.
Some toy data:
d = pd.DataFrame({'letters':[['a', 'b', 'c'], np.nan, ['a', 'e', 'd', 'c'], ['a', 'e', 'c']]})
What I can come up with is to loop through the rows and add values to a dictionary:
letter_count = {}
for i in range(len(d)):
if d.iloc[i, ]['letters'] is np.nan:
continue
else:
for letter in d.iloc[i, ]['letters']:
letter_count[letter] = letter_count.get(letter, 0) + 1
This worked for me, except it was not very fast since my dataset was big. I assume by avoiding the explicit for-loop may help, but I cannot come up with a more 'pandasian' way to do this.
Any help is appreciated.
Use chain.from_iterable
to flatten the lists, and then Counter
to count them:
from itertools import chain
from collections import Counter
pd.Series(Counter(chain.from_iterable(d.letters.dropna())))
a 3
b 1
c 3
e 2
d 1
dtype: int64
Or, use value_counts
for the counting step:
pd.Series(list(chain.from_iterable(d.letters.dropna()))).value_counts()
a 3
c 3
e 2
b 1
d 1
dtype: int64
Alternatively, np.unique
, also quite performant:
u, c = np.unique(list(chain.from_iterable(d.letters.dropna())), return_counts=True)
pd.Series(dict(zip(u, c)))
a 3
b 1
c 3
d 1
e 2
dtype: int64
Ok unnesting again
unnesting(d.dropna(),['letters'])['letters'].value_counts()
Out[71]:
a 3
c 3
e 2
d 1
b 1
Name: letters, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With