Based off this question.
df = pandas.DataFrame([[2001, "Jack", 77], [2005, "Jack", 44], [2001, "Jill", 93]],columns=['Year','Name','Value']) Year Name Value 0 2001 Jack 77 1 2005 Jack 44 2 2001 Jill 93
For each unique Name, I would like to keep the row with the largest Year value. In the above example I would like to get the table
Year Name Value 0 2005 Jack 44 1 2001 Jill 93
I tried solving this question with groupby
+ (apply
):
df.groupby('Name', as_index=False)\
.apply(lambda x: x.sort_values('Value').head(1))
Year Name Value
0 0 2001 Jack 44
1 2 2001 Jill 93
Not the best approach, but I'm more interested in what is happening, and why. The result has a MultiIndex
that looks like this:
MultiIndex(levels=[[0, 1], [0, 2]],
labels=[[0, 1], [0, 1]])
I'm not looking for a workaround. I'm actually more interested to know why this happens, and how I can prevent it without changing my approach.
IIUC, use group_keys=False
:
df.groupby('Name', group_keys=False).apply(lambda x: x.sort_values('Value').head(1))
Output:
Year Name Value
1 2005 Jack 44
2 2001 Jill 93
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With