Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count frequencies of values in different columns of a dataframe

Tags:

python

pandas

My data has the following shape:

id   column1   column2
a    x         1
a    x         3
a    y         3
b    y         1
b    y         2

And I want to get to most repeated value for each id as well as its frequency percentage.

id   column1  %     column2  %
a    x        66.6  3        66.6
b    y        100.0 N/A      N/A

a special case is when there are equal frequencies, I output N/A for both column and percentage.

Right now my solution is purely using python dictionaries and lists. However, I am struggling to approach this from a DataFrame point of view

like image 252
ooo Avatar asked Jan 02 '23 07:01

ooo


2 Answers

I can only think of for loop then concat

g=df.groupby('id')
pd.concat([ g[x].value_counts(normalize=True).groupby(level=0).head(1).to_frame('%').reset_index(level=1) for x in df.columns[1:]],axis=1)
Out[135]: 
   column1         %  column2         %
id                                     
a        x  0.666667        3  0.666667
b        y  1.000000        1  0.500000
like image 189
BENY Avatar answered Jan 03 '23 20:01

BENY


A (very) similar solution to @Wen, but accounts for the condition where the ratios of a group are the same, and the result should be NaN:

u = df.groupby('id')
c = ('column1', 'column2')

def helper(group, col):
    return (group[col].value_counts(normalize=True, sort=True)
            .drop_duplicates(keep=False)
            .groupby(level=0).head(1)
            .to_frame(f'{col}_%')
            .reset_index(level=1))

pd.concat([helper(u, col) for col in c], axis=1)

  column1  column1_%  column2  column2_%
a       x   0.666667      3.0   0.666667
b       y   1.000000      NaN        NaN
like image 34
user3483203 Avatar answered Jan 03 '23 20:01

user3483203