Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to get frequency of elements in a pandas column of lists

Tags:

python

pandas

I am trying to count the frequency of elements in a column of a pandas DataFrame.

Some toy data:

d = pd.DataFrame({'letters':[['a', 'b', 'c'], np.nan, ['a', 'e', 'd', 'c'], ['a', 'e', 'c']]})

What I can come up with is to loop through the rows and add values to a dictionary:

letter_count = {}
for i in range(len(d)):
    if d.iloc[i, ]['letters'] is np.nan:
        continue
    else:
        for letter in d.iloc[i, ]['letters']:
            letter_count[letter] = letter_count.get(letter, 0) + 1

This worked for me, except it was not very fast since my dataset was big. I assume by avoiding the explicit for-loop may help, but I cannot come up with a more 'pandasian' way to do this.

Any help is appreciated.

like image 465
Xiaoyu Lu Avatar asked Dec 13 '22 13:12

Xiaoyu Lu


2 Answers

Use chain.from_iterable to flatten the lists, and then Counter to count them:

from itertools import chain
from collections import Counter

pd.Series(Counter(chain.from_iterable(d.letters.dropna())))

a    3
b    1
c    3
e    2
d    1
dtype: int64

Or, use value_counts for the counting step:

pd.Series(list(chain.from_iterable(d.letters.dropna()))).value_counts()

a    3
c    3
e    2
b    1
d    1
dtype: int64

Alternatively, np.unique, also quite performant:

u, c = np.unique(list(chain.from_iterable(d.letters.dropna())), return_counts=True)

pd.Series(dict(zip(u, c)))

a    3
b    1
c    3
d    1
e    2
dtype: int64
like image 130
cs95 Avatar answered Dec 16 '22 03:12

cs95


Ok unnesting again

unnesting(d.dropna(),['letters'])['letters'].value_counts()
Out[71]: 
a    3
c    3
e    2
d    1
b    1
Name: letters, dtype: int64
like image 22
BENY Avatar answered Dec 16 '22 03:12

BENY