Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to get last entries from Pandas data frame

I recently had to get the last set status for certain items, labeled with ids. I found this answer: Python : How can I get Rows which have the max value of the group to which they belong?

To my surprise on a dataset with only ~2e6 rows it was fairly slow. However I do not need to get all max values, only the last one.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "id": np.random.randint(1, 1000, size=5000),
    "status": np.random.randint(1, 10, size=5000),
    "date": [
        time.strftime("%Y-%m-%d", time.localtime(time.time() - x))
        for x in np.random.randint(-5e7, 5e7, size=5000)
    ],
})

%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
1 loops, best of 3: 576 ms per loop

%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
100 loops, best of 3: 4.82 ms per loop

The first one was the solution I found in the link, which seems a way which allows more complex operations.

However for my issue I could sort and drop duplicates and reindex, which performs a lot better. Especially on larger data sets this really makes a difference.

My Questions: are there other ways to achieve what I want to do? Possibly with even better performance?

like image 877
galinden Avatar asked Oct 31 '22 16:10

galinden


1 Answers

Another way to approach this is to use an aggregation on the groupby, followed by a selection on the full dataframe.

df.iloc[df.groupby('id')['date'].idxmax()]

This appears to be about a factor of 5-10 faster than the solutions you proposed (see below). Note that this will only work if the 'date' column is numerical rather than string type, and that this transformation also speeds up your sorting-based solution:

# Timing your original solutions:
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 826 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 100 loops, best of 3: 5.1 ms per loop

# convert the date
df['date'] = pd.to_datetime(df['date'])

# new times on your solutions
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 815 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 1000 loops, best of 3: 1.99 ms per loop

# my aggregation solution
%timeit df.iloc[df.groupby('id')['date'].idxmax()]
# 10 loops, best of 3: 135 ms per loop
like image 174
jakevdp Avatar answered Nov 09 '22 10:11

jakevdp