Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas normalise by column on groupby

Given a pandas dataframe such as

import pandas as pd

df = pd.DataFrame({'id': ['id1','id1','id2','id2'] , 
                   'x':  [1,2,3,4], 
                   'y':  [10,20,30,40]})

each numerical column may be normalised to the unit interval [0,1] with

columns = ['x', 'y']

for column in columns:
    df[column] = (df[column] - df[column].min()) / (df[column].max() - df[column].min())

resulting in

    id         x         y
0  id1  0.000000  0.000000
1  id1  0.333333  0.333333
2  id2  0.666667  0.666667
3  id2  1.000000  1.000000

However, how to apply this normalisation on each numerical column for each id? The expected outcome would be in this oversimplified example

    id         x         y
0  id1  0.000000  0.000000
1  id1  1.000000  1.000000
2  id2  0.000000  0.000000
3  id2  1.000000  1.000000

It proves unclear how to update each normalised column after

df.groupby(['id']).apply(lambda x: ...)
like image 543
iris Avatar asked Mar 29 '21 11:03

iris


2 Answers

Use GroupBy.transform:

columns = ['x', 'y']
g = df.groupby('id')[columns]
df[columns] = (df[columns] - g.transform('min')) / (g.transform('max') - g.transform('min'))
    
print (df)
    id    x    y
0  id1  0.0  0.0
1  id1  1.0  1.0
2  id2  0.0  0.0
3  id2  1.0  1.0
like image 61
jezrael Avatar answered Oct 20 '22 15:10

jezrael


It proves unclear how to update each normalised column after df.groupby(['id']).apply(lambda x: ...)

You can apply again:

df.groupby(["id"])\
.apply(lambda id_df: id_df[columns]\
                     .apply(lambda serie: (serie - serie.min()) / (serie.max() - serie.min())))
like image 36
Mustafa Aydın Avatar answered Oct 20 '22 14:10

Mustafa Aydın