I would like to keep the latest entry per group in a dataframe:
from datetime import date
import pandas as pd
data = [
['A', date(2018,2,1), "I want this"],
['A', date(2018,1,1), "Don't want"],
['B', date(2019,4,1), "Don't want"],
['B', date(2019,5,1), "I want this"]]
df = pd.DataFrame(data, columns=['name', 'date', 'result'])
The following does what I want (found and credits here):
df.sort_values('date').groupby('name').tail(1)
name date result
0 A 2018-02-01 I want this
3 B 2019-05-01 I want this
But how do I know the order is always preserved when you do a groupby on a sorted data frame like df? Is it somewhere documented?
No it won't. Try to replace A with Z to see it.
Use sort=False:
df.sort_values('date').groupby('name', sort=False).tail(1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With