Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by two columns and max value of third in pandas python

I have a dataframe with PERIOD_START_TIME, ID, a few more columns and column VALUE. What I need is group by PERIOD_START_TIME and ID(cause there are duplicate rows by time and ID) and take max value of column VALUE. df:

PERIOD_START_TIME     ID       VALUE
06.01.2017 02:00:00   55  ...   35
06.01.2017 02:00:00   55  ...   22
06.01.2017 03:00:00   55  ...   63
06.01.2017 03:00:00   55  ...   33
06.01.2017 04:00:00   55  ...   63
06.01.2017 04:00:00   55  ...   45
06.01.2017 02:00:00   65  ...   10
06.01.2017 02:00:00   65  ...   5
06.01.2017 03:00:00   65  ...   22
06.01.2017 03:00:00   65  ...   5
06.01.2017 04:00:00   65  ...   12
06.01.2017 04:00:00   65  ...   15

Desired output:

PERIOD_START_TIME     ID  ...  VALUE
06.01.2017 02:00:00   55  ...   35
06.01.2017 03:00:00   55  ...   63
06.01.2017 04:00:00   55  ...   63
06.01.2017 02:00:00   65  ...   10
06.01.2017 03:00:00   65  ...   22
06.01.2017 04:00:00   65  ...   15
like image 369
jovicbg Avatar asked Dec 18 '22 06:12

jovicbg


1 Answers

Use groupby and aggregate max:

print (df)
      PERIOD_START_TIME  ID  A  VALUE
0   06.01.2017 02:00:00  55  8     35
1   06.01.2017 02:00:00  55  8     22
2   06.01.2017 03:00:00  55  8     63
3   06.01.2017 03:00:00  55  8     33
4   06.01.2017 04:00:00  55  8     63
5   06.01.2017 04:00:00  55  8     45
6   06.01.2017 02:00:00  65  8     10
7   06.01.2017 02:00:00  65  8      5
8   06.01.2017 03:00:00  65  8     22
9   06.01.2017 03:00:00  65  8      5
10  06.01.2017 04:00:00  65  8     12
11  06.01.2017 04:00:00  65  8     15

df = df.groupby(['PERIOD_START_TIME','ID'], as_index=False)['VALUE'].max()            

Or:

df = df.groupby(['PERIOD_START_TIME','ID'])['VALUE'].max().reset_index()

print (df)
     PERIOD_START_TIME  ID  VALUE
0  06.01.2017 02:00:00  55     35
1  06.01.2017 02:00:00  65     10
2  06.01.2017 03:00:00  55     63
3  06.01.2017 03:00:00  65     22
4  06.01.2017 04:00:00  55     63
5  06.01.2017 04:00:00  65     15

For more columns need idxmax and select by loc:

df = df.loc[df.groupby(['PERIOD_START_TIME','ID'])['VALUE'].idxmax()]  
print (df)
      PERIOD_START_TIME  ID  A  VALUE
0   06.01.2017 02:00:00  55  8     35
6   06.01.2017 02:00:00  65  8     10
2   06.01.2017 03:00:00  55  8     63
8   06.01.2017 03:00:00  65  8     22
4   06.01.2017 04:00:00  55  8     63
11  06.01.2017 04:00:00  65  8     15 

Alternative:

cols = ['PERIOD_START_TIME','ID']
df = df.sort_values(cols).groupby(cols, as_index=False).first()
print (df)
     PERIOD_START_TIME  ID  A  VALUE
0  06.01.2017 02:00:00  55  8     35
1  06.01.2017 02:00:00  65  8     10
2  06.01.2017 03:00:00  55  8     63
3  06.01.2017 03:00:00  65  8     22
4  06.01.2017 04:00:00  55  8     63
5  06.01.2017 04:00:00  65  8     12
like image 127
jezrael Avatar answered Jan 18 '23 22:01

jezrael