Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I count the values from a pandas column which is a list of strings?

Tags:

python

pandas

I have a dataframe column which is a list of strings:

df['colors']

0              ['blue','green','brown']
1              []
2              ['green','red','blue']
3              ['purple']
4              ['brown']

What I'm trying to get is:

'blue' 2
'green' 2
'brown' 2
'red' 1
'purple' 1
[] 1

Without knowing what I'm doing I even managed to count the characters in the entire column

b 5
[ 5
] 5 

etc.

which I think was pretty cool, but the solution to this escapes me

like image 494
vaeinoe Avatar asked Jan 25 '23 22:01

vaeinoe


2 Answers

Use a Counter + chain, which is meant to do exactly this. Then construct the Series from the Counter object.

import pandas as pd
from collections import Counter
from itertools import chain

s = pd.Series([['blue','green','brown'], [], ['green','red','blue']])

pd.Series(Counter(chain.from_iterable(s)))
#blue     2
#green    2
#brown    1
#red      1
#dtype: int64

While explode + value_counts are the pandas way to do things, they're slower for shorter lists.

import perfplot
import pandas as pd
import numpy as np

from collections import Counter
from itertools import chain

def counter(s):
    return pd.Series(Counter(chain.from_iterable(s)))

def explode(s):
    return s.explode().value_counts()

perfplot.show(
    setup=lambda n: pd.Series([['blue','green','brown'], [], ['green','red','blue']]*n), 
    kernels=[
        lambda s: counter(s),
        lambda s: explode(s),
    ],
    labels=['counter', 'explode'],
    n_range=[2 ** k for k in range(17)],
    equality_check=np.allclose,  
    xlabel='~len(s)'
)

enter image description here

like image 29
ALollz Avatar answered Jan 29 '23 22:01

ALollz


Solution

Best option: df.colors.explode().dropna().value_counts().

However, if you also want to have counts for empty lists ([]), use Method-1.B/C similar to what was suggested by Quang Hoang in the comments.

You can use any of the following two methods.

  • Method-1: Use pandas methods alone ⭐⭐⭐

    explode --> dropna --> value_counts

  • Method-2: Use list.extend --> pd.Series.value_counts
## Method-1
# A. If you don't want counts for empty []
df.colors.explode().dropna().value_counts() 

# B. If you want counts for empty [] (classified as NaN)
df.colors.explode().value_counts(dropna=False) # returns [] as Nan

# C. If you want counts for empty [] (classified as [])
df.colors.explode().fillna('[]').value_counts() # returns [] as []

## Method-2
colors = []
_ = [colors.extend(e) for e in df.colors if len(e)>0]
pd.Series(colors).value_counts()

Output:

green     2
blue      2
brown     2
red       1
purple    1
# NaN     1  ## For Method-1.B
# []      1  ## For Method-1.C
dtype: int64

Dummy Data

import pandas as pd

df = pd.DataFrame({'colors':[['blue','green','brown'],
                             [],
                             ['green','red','blue'],
                             ['purple'],
                             ['brown']]})
like image 176
CypherX Avatar answered Jan 29 '23 21:01

CypherX