I'm selecting one max row per group and I'm using groupby
/agg
to return index values and select the rows using loc
.
For example, to group by "Id"
and then select the row with the highest "delta"
value:
selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
selected_rows = df.loc[selected_idx, :]
However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows.
I have two questions for experts:
[Update]
Thank you so much for @unutbu 's analysis!
sort_drop
it is! On my i7/32GRAM machine, groupby+idxmax hangs for nearly 14 hours (never return a thing) however sort_drop
handled it LESS THAN A MINUTE!
I still need to look at how pandas implements each method but problems solved for now! I love StackOverflow.
The fastest option depends not only on length of the DataFrame (in this case, around 13M rows) but also on the number of groups. Below are perfplots which compare a number of ways of finding the maximum in each group:
If there an only a few (large) groups, using_idxmax
may be the fastest option:
If there are many (small) groups and the DataFrame is not too large, using_sort_drop
may be the fastest option:
Keep in mind, however, that while using_sort_drop
, using_sort
and using_rank
start out looking very fast, as N = len(df)
increases, their speed relative to the other options disappears quickly. For large enough N
, using_idxmax
becomes the fastest option, even if there are many groups.
using_sort_drop
, using_sort
and using_rank
sorts the DataFrame (or groups within the DataFrame). Sorting is O(N * log(N))
on average, while the other methods use O(N)
operations. This is why methods like using_idxmax
beats using_sort_drop
for very large DataFrames.
Be aware that benchmark results may vary for a number of reasons, including machine specs, OS, and software versions. So it is important to run benchmarks on your own machine, and with test data tailored to your situation.
Based on the perfplots above, using_sort_drop
may be an option worth considering for your DataFrame of 13M rows, especially if it has many (small) groups. Otherwise, I would suspect using_idxmax
to be the fastest option -- but again, it's important that you check benchmarks on your machine.
Here is the setup I used to make the perfplots:
import numpy as np
import pandas as pd
import perfplot
def make_df(N):
# lots of small groups
df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
# few large groups
# df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
return df
def using_idxmax(df):
return df.loc[df.groupby("Id")['delta'].idxmax()]
def max_mask(s):
i = np.asarray(s).argmax()
result = [False]*len(s)
result[i] = True
return result
def using_custom_mask(df):
mask = df.groupby("Id")['delta'].transform(max_mask)
return df.loc[mask]
def using_isin(df):
idx = df.groupby("Id")['delta'].idxmax()
mask = df.index.isin(idx)
return df.loc[mask]
def using_sort(df):
df = df.sort_values(by=['delta'], ascending=False, kind='mergesort')
return df.groupby('Id', as_index=False).first()
def using_rank(df):
mask = (df.groupby('Id')['delta'].rank(method='first', ascending=False) == 1)
return df.loc[mask]
def using_sort_drop(df):
# Thanks to jezrael
# https://stackoverflow.com/questions/50381064/select-the-max-row-per-group-pandas-performance-issue/50389889?noredirect=1#comment87795818_50389889
return df.sort_values(by=['delta'], ascending=False, kind='mergesort').drop_duplicates('Id')
def using_apply(df):
selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
return df.loc[selected_idx]
def check(df1, df2):
df1 = df1.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
df2 = df2.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
return df1.equals(df2)
perfplot.show(
setup=make_df,
kernels=[using_idxmax, using_custom_mask, using_isin, using_sort,
using_rank, using_apply, using_sort_drop],
n_range=[2**k for k in range(2, 20)],
logx=True,
logy=True,
xlabel='len(df)',
repeat=75,
equality_check=check)
Another way to benchmark is to use IPython %timeit:
In [55]: df = make_df(2**20)
In [56]: %timeit using_sort_drop(df)
1 loop, best of 3: 403 ms per loop
In [57]: %timeit using_rank(df)
1 loop, best of 3: 1.04 s per loop
In [58]: %timeit using_idxmax(df)
1 loop, best of 3: 15.8 s per loop
from numba import njit
import numpy as np
@njit
def nidxmax(bins, k, weights):
out = np.zeros(k, np.int64)
trk = np.zeros(k)
for i, w in enumerate(weights - (weights.min() - 1)):
b = bins[i]
if w > trk[b]:
trk[b] = w
out[b] = i
return np.sort(out)
def with_numba_idxmax(df):
f, u = pd.factorize(df.Id)
return df.iloc[nidxmax(f, len(u), df.delta.values)]
Borrowing from @unutbu
def make_df(N):
# lots of small groups
df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
# few large groups
# df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
return df
jit
with_numba_idxmax(make_df(10));
df = make_df(2**20)
%timeit with_numba_idxmax(df)
%timeit using_sort_drop(df)
47.4 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
194 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With