I want to drop duplicates and keep the last timestamp. The duplicates that want to be dropped is customer_id
and var_name
.Here's my data
customer_id value var_name timestamp
1 1 apple 2018-03-22 00:00:00.000
2 3 apple 2018-03-23 08:00:00.000
2 4 apple 2018-03-24 08:00:00.000
1 1 orange 2018-03-22 08:00:00.000
2 3 orange 2018-03-24 08:00:00.000
2 5 orange 2018-03-23 08:00:00.000
So the result will be
customer_id value var_name timestamp
1 1 apple 2018-03-22 00:00:00.000
2 4 apple 2018-03-24 08:00:00.000
1 1 orange 2018-03-22 08:00:00.000
2 3 orange 2018-03-24 08:00:00.000
I think need sort_values
with drop_duplicates
:
df = df.sort_values('timestamp').drop_duplicates(['customer_id','var_name'], keep='last')
print (df)
customer_id value var_name timestamp
0 1 1 apple 2018-03-22 00:00:00.000
3 1 1 orange 2018-03-22 08:00:00.000
2 2 4 apple 2018-03-24 08:00:00.000
4 2 3 orange 2018-03-24 08:00:00.000
If dont need sorting - order is important:
df = df.loc[df.groupby(['customer_id','var_name'], sort=False)['timestamp'].idxmax()]
print (df)
customer_id value var_name timestamp
0 1 1 apple 2018-03-22 00:00:00
2 2 4 apple 2018-03-24 08:00:00
3 1 1 orange 2018-03-22 08:00:00
4 2 3 orange 2018-03-24 08:00:00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With