Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the max value in Python Column

I have a data frame (combined_ranking_df) like this in pandas python:

                Id  Rank                         Activity
0              14035   8.0                         deployed
1              47728   8.0                         deployed
2              24259   1.0                         NaN
3              24259   6.0                         WIP
4              14251   8.0                         deployed
5              14250   1.0                         NaN
6              14250   6.0                         WIP
7              14250   5.0                         NaN
8              14250   5.0                         NaN
9              14250   1.0                         NaN

I am trying to get the max value for each id. for example, 14250 it should be 6.0. 24259 it should be 6.0.

                Id  Rank                         Activity
0              14035   8.0                         deployed
1              47728   8.0                         deployed
3              24259   6.0                         WIP
4              14251   8.0                         deployed
6              14250   6.0                         WIP

I tried doing combined_ranking_df.groupby(['Id'], sort=False)['Rank'].max() but the result i achieved was the first dataframe (nothing changed).

What am I doing wrong?

like image 455
Adam Avatar asked Jul 12 '17 17:07

Adam


4 Answers

Option 1
Same as @ayhan's answer here
This answers the question by sorting the dataframe that leaves the maximal value in the last position per 'Id' group. pd.DataFrame.drop_duplicates enables us to keep the first or last of each group. However, this is a handy coincidence that is very fast. It does not generalize to say the top two per 'Id'.

df.sort_values('Rank').drop_duplicates('Id', 'last')

      Id  Rank  Activity
3  24259   6.0       WIP
6  14250   6.0       WIP
0  14035   8.0  deployed
1  47728   8.0  deployed
4  14251   8.0  deployed

You can sort the index at the end

df.sort_values('Rank').drop_duplicates('Id', 'last').sort_index()

      Id  Rank  Activity
0  14035   8.0  deployed
1  47728   8.0  deployed
3  24259   6.0       WIP
4  14251   8.0  deployed
6  14250   6.0       WIP

Option 2
groupby and idxmax
This is what I'd consider the most idiomatic way to solve this problem. @MaxU's answer is the best way that generalizes to the largest n per 'Id'.

df.loc[df.groupby('Id', sort=False).Rank.idxmax()]

      Id  Rank  Activity
0  14035   8.0  deployed
1  47728   8.0  deployed
3  24259   6.0       WIP
4  14251   8.0  deployed
6  14250   6.0       WIP
like image 61
piRSquared Avatar answered Oct 23 '22 23:10

piRSquared


IIUC:

In [40]: df.groupby('Id', as_index=False, sort=False) \
           .apply(lambda x: x.nlargest(1, ['Rank'])) \
    ...:   .reset_index(level=1, drop=True)
Out[40]:
      Id  Rank  Activity
0  14035   8.0  deployed
1  47728   8.0  deployed
2  24259   6.0       WIP
3  14251   8.0  deployed
4  14250   6.0       WIP

or a nicer version from @piRSquared:

In [41]: df.groupby('Id', group_keys=False, sort=False) \
           .apply(pd.DataFrame.nlargest, n=1, columns='Rank')
Out[41]:
      Id  Rank  Activity
0  14035   8.0  deployed
1  47728   8.0  deployed
3  24259   6.0       WIP
4  14251   8.0  deployed
6  14250   6.0       WIP
like image 6
MaxU - stop WAR against UA Avatar answered Oct 23 '22 23:10

MaxU - stop WAR against UA


Try storing it and then consult that stored groupedby

groups = combined_ranking_df.groupby(['Id'], as_index=False, sort=False).max()[['Id','Rank']].

      Id  Rank
0  14035   8.0
1  47728   8.0
2  24259   6.0
3  14251   8.0
4  14250   6.0
like image 4
Diego Aguado Avatar answered Oct 23 '22 22:10

Diego Aguado


You can create a boolean index to check if the Rank for a given Id equals its max value. Then use boolean indexing to extract the max values from the dataframe.

The mask is created using a groupby on Id with the help of transform, which preserves the original dimensions of the dataframe.

>>> df[(df[['Rank']] == df[['Id', 'Rank']].groupby('Id').transform(max)).squeeze().tolist()]
      Id  Rank  Activity
0  14035     8  deployed
1  47728     8  deployed
3  24259     6       WIP
4  14251     8  deployed
6  14250     6       WIP
like image 3
Alexander Avatar answered Oct 23 '22 22:10

Alexander