I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs    timestamp     value
0      2010-10-30     1
       2010-11-30     2
1      2000-01-01     300
       2007-01-01     33
       2010-01-01     400
2      2000-01-01     11
So basically the result I want is
IDs    timestamp    value
0      2010-11-30   2
1      2010-01-01   400
2      2000-01-01   11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs    timestamp     value
0      2010-10-30     1
0      2010-11-30     2
1      2000-01-01     300
1      2007-01-01     33
1      2010-01-01     400
2      2000-01-01     11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby 
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
              timestamp  value
IDs                           
0   2010-11-30 00:00:00      2
1   2010-01-01 00:00:00    400
2   2000-01-01 00:00:00     11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
                        One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values. 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With