I recently had to get the last set status for certain items, labeled with ids. I found this answer: Python : How can I get Rows which have the max value of the group to which they belong?
To my surprise on a dataset with only ~2e6 rows it was fairly slow. However I do not need to get all max values, only the last one.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"id": np.random.randint(1, 1000, size=5000),
"status": np.random.randint(1, 10, size=5000),
"date": [
time.strftime("%Y-%m-%d", time.localtime(time.time() - x))
for x in np.random.randint(-5e7, 5e7, size=5000)
],
})
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
1 loops, best of 3: 576 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
100 loops, best of 3: 4.82 ms per loop
The first one was the solution I found in the link, which seems a way which allows more complex operations.
However for my issue I could sort and drop duplicates and reindex, which performs a lot better. Especially on larger data sets this really makes a difference.
My Questions: are there other ways to achieve what I want to do? Possibly with even better performance?
Another way to approach this is to use an aggregation on the groupby, followed by a selection on the full dataframe.
df.iloc[df.groupby('id')['date'].idxmax()]
This appears to be about a factor of 5-10 faster than the solutions you proposed (see below). Note that this will only work if the 'date'
column is numerical rather than string type, and that this transformation also speeds up your sorting-based solution:
# Timing your original solutions:
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 826 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 100 loops, best of 3: 5.1 ms per loop
# convert the date
df['date'] = pd.to_datetime(df['date'])
# new times on your solutions
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 815 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 1000 loops, best of 3: 1.99 ms per loop
# my aggregation solution
%timeit df.iloc[df.groupby('id')['date'].idxmax()]
# 10 loops, best of 3: 135 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With