Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speedup the groupby.filter() operation?

I have the following code with groupby function

 order_diff = order.groupby("CD_QTY").filter(
     lambda x: x["BOX_ID"].nunique() > 1
)

But the filter function is very slow

I would like to use the transform or map function. But the result with both transform and map function is different than the filter function

I have tried transform() and map()

# tranform
oder_diff = order[order.groupby("CD_QTY")["BOX_ID"].transform('nunique').gt(1)]

#using map()
order[order["CD_QTY"].map(order.groupby("CD_QTY")["BOX_ID"].nunique()).gt(1)]

I am a bit confused about why the results are different than the filter(). I would appreciate your feedback on this. Thank you.

like image 548
prb07 Avatar asked Oct 15 '25 17:10

prb07


2 Answers

My rule of thumb is that any pandas function that take a custom lambda will be slowed down significantly due to looping and hence perfomance-limited by the GIL.

If you care about speed, use vectorized functions as much as possible:

# Some mock data
n = 10_000

np.random.seed(42)
cd_qty = np.random.randint(1, n, n)
box_id = np.random.randint(1, 5, n)

order = pd.DataFrame({
    'CD_QTY': cd_qty,
    'BOX_ID': box_id
})

# The original solution. 1.59s
order_diff_1 = order.groupby("CD_QTY").filter(
     lambda x: x["BOX_ID"].nunique() > 1
)

# Vectorized. 8.5 ms -> 187x faster
count = order.groupby(['CD_QTY'])['BOX_ID'].nunique()
cond = order['CD_QTY'].isin(count[count > 1].index)
order_diff_2 = order.loc[cond]

# Check if they produce the same outputs
all(order_diff_1 == order_diff_2) # --> True
like image 85
Code Different Avatar answered Oct 18 '25 06:10

Code Different


Testing and your both solutions are faster like accepted answer, here is winner GroupBy.transform:

# Some mock data
n = 10_000

np.random.seed(42)
cd_qty = np.random.randint(1, n, n)
box_id = np.random.randint(1, 5, n)

order = pd.DataFrame({
    'CD_QTY': cd_qty,
    'BOX_ID': box_id
})

count = order.groupby(['CD_QTY'])['BOX_ID'].nunique()
cond = order['CD_QTY'].isin(count[count > 1].index)
order_diff_2 = order.loc[cond]

order_diff = order[order.groupby("CD_QTY")["BOX_ID"].transform('nunique').gt(1)]
order_diff_3 =order[order["CD_QTY"].map(order.groupby("CD_QTY")["BOX_ID"].nunique()).gt(1)]

print (all(order_diff == order_diff_2))
True
print (all(order_diff == order_diff_3))
True

In [201]: %%timeit
     ...: count = order.groupby(['CD_QTY'])['BOX_ID'].nunique()
     ...: cond = order['CD_QTY'].isin(count[count > 1].index)
     ...: order_diff_2 = order.loc[cond]
     ...: 
4.8 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [202]: %%timeit
     ...: order[order.groupby("CD_QTY")["BOX_ID"].transform('nunique').gt(1)]
     ...: 
4.02 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [203]: %%timeit
     ...: order[order["CD_QTY"].map(order.groupby("CD_QTY")["BOX_ID"].nunique()).gt(1)]
     ...: 
4.31 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 33
jezrael Avatar answered Oct 18 '25 07:10

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!