My data has the following shape:
id column1 column2
a x 1
a x 3
a y 3
b y 1
b y 2
And I want to get to most repeated value for each id as well as its frequency percentage.
id column1 % column2 %
a x 66.6 3 66.6
b y 100.0 N/A N/A
a special case is when there are equal frequencies, I output N/A for both column and percentage.
Right now my solution is purely using python dictionaries and lists. However, I am struggling to approach this from a DataFrame point of view
I can only think of for loop then concat
g=df.groupby('id')
pd.concat([ g[x].value_counts(normalize=True).groupby(level=0).head(1).to_frame('%').reset_index(level=1) for x in df.columns[1:]],axis=1)
Out[135]:
column1 % column2 %
id
a x 0.666667 3 0.666667
b y 1.000000 1 0.500000
A (very) similar solution to @Wen, but accounts for the condition where the ratios of a group are the same, and the result should be NaN
:
u = df.groupby('id')
c = ('column1', 'column2')
def helper(group, col):
return (group[col].value_counts(normalize=True, sort=True)
.drop_duplicates(keep=False)
.groupby(level=0).head(1)
.to_frame(f'{col}_%')
.reset_index(level=1))
pd.concat([helper(u, col) for col in c], axis=1)
column1 column1_% column2 column2_%
a x 0.666667 3.0 0.666667
b y 1.000000 NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With