I have a data frame (combined_ranking_df
) like this in pandas python:
Id Rank Activity
0 14035 8.0 deployed
1 47728 8.0 deployed
2 24259 1.0 NaN
3 24259 6.0 WIP
4 14251 8.0 deployed
5 14250 1.0 NaN
6 14250 6.0 WIP
7 14250 5.0 NaN
8 14250 5.0 NaN
9 14250 1.0 NaN
I am trying to get the max value for each id. for example, 14250 it should be 6.0. 24259 it should be 6.0.
Id Rank Activity
0 14035 8.0 deployed
1 47728 8.0 deployed
3 24259 6.0 WIP
4 14251 8.0 deployed
6 14250 6.0 WIP
I tried doing combined_ranking_df.groupby(['Id'], sort=False)['Rank'].max()
but the result i achieved was the first dataframe
(nothing changed).
What am I doing wrong?
Option 1
Same as @ayhan's answer here
This answers the question by sorting the dataframe that leaves the maximal value in the last position per 'Id'
group. pd.DataFrame.drop_duplicates
enables us to keep the first or last of each group. However, this is a handy coincidence that is very fast. It does not generalize to say the top two per 'Id'
.
df.sort_values('Rank').drop_duplicates('Id', 'last')
Id Rank Activity
3 24259 6.0 WIP
6 14250 6.0 WIP
0 14035 8.0 deployed
1 47728 8.0 deployed
4 14251 8.0 deployed
You can sort the index at the end
df.sort_values('Rank').drop_duplicates('Id', 'last').sort_index()
Id Rank Activity
0 14035 8.0 deployed
1 47728 8.0 deployed
3 24259 6.0 WIP
4 14251 8.0 deployed
6 14250 6.0 WIP
Option 2groupby
and idxmax
This is what I'd consider the most idiomatic way to solve this problem. @MaxU's answer is the best way that generalizes to the largest n
per 'Id'
.
df.loc[df.groupby('Id', sort=False).Rank.idxmax()]
Id Rank Activity
0 14035 8.0 deployed
1 47728 8.0 deployed
3 24259 6.0 WIP
4 14251 8.0 deployed
6 14250 6.0 WIP
IIUC:
In [40]: df.groupby('Id', as_index=False, sort=False) \
.apply(lambda x: x.nlargest(1, ['Rank'])) \
...: .reset_index(level=1, drop=True)
Out[40]:
Id Rank Activity
0 14035 8.0 deployed
1 47728 8.0 deployed
2 24259 6.0 WIP
3 14251 8.0 deployed
4 14250 6.0 WIP
or a nicer version from @piRSquared:
In [41]: df.groupby('Id', group_keys=False, sort=False) \
.apply(pd.DataFrame.nlargest, n=1, columns='Rank')
Out[41]:
Id Rank Activity
0 14035 8.0 deployed
1 47728 8.0 deployed
3 24259 6.0 WIP
4 14251 8.0 deployed
6 14250 6.0 WIP
Try storing it and then consult that stored groupedby
groups = combined_ranking_df.groupby(['Id'], as_index=False, sort=False).max()[['Id','Rank']].
Id Rank
0 14035 8.0
1 47728 8.0
2 24259 6.0
3 14251 8.0
4 14250 6.0
You can create a boolean index to check if the Rank
for a given Id
equals its max value. Then use boolean indexing to extract the max values from the dataframe.
The mask is created using a groupby
on Id
with the help of transform
, which preserves the original dimensions of the dataframe.
>>> df[(df[['Rank']] == df[['Id', 'Rank']].groupby('Id').transform(max)).squeeze().tolist()]
Id Rank Activity
0 14035 8 deployed
1 47728 8 deployed
3 24259 6 WIP
4 14251 8 deployed
6 14250 6 WIP
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With