Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up pandas groupby - apply function to be comparable to R's data.table

I have data like this

   location  sales  store
0        68    583     17
1        28    857      2
2        55    190     59
3        98    517     64
4        94    892     79
...

For each unique pair (location, store), there are 1 or more sales. I want to add a column, pcnt_sales that shows what percent of the total sales for that (location, store) pair was made up by the sale in the given row.

   location  sales  store  pcnt_sales
0        68    583     17    0.254363
1        28    857      2    0.346543
2        55    190     59    1.000000
3        98    517     64    0.272105
4        94    892     79    1.000000
...

This works, but is slow

import pandas as pd
import numpy as np

df = pd.DataFrame({'location':np.random.randint(0, 100, 10000), 'store':np.random.randint(0, 100, 10000), 'sales': np.random.randint(0, 1000, 10000)})

import timeit
start_time = timeit.default_timer()
df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
print(timeit.default_timer() - start_time)  # 1.46 seconds

By comparison, R's data.table does this super fast

library(data.table)

dt <- data.table(location=sample(100, size=10000, replace=TRUE), store=sample(100, size=10000, replace=TRUE), sales=sample(1000, size=10000, replace=TRUE))

ptm <- proc.time()
dt[, pcnt_sales:=sales/sum(sales), by=c("location", "store")]
proc.time() - ptm  # 0.007 seconds

How do I do this efficiently in Pandas (especially considering my real dataset has millions of rows)?

like image 680
Ben Avatar asked May 09 '16 01:05

Ben


1 Answers

For performance you want to avoid apply. You could use transform to get the result of the groupby expanded to the original index instead, at which point a division would work at vectorized speed:

>>> %timeit df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
1 loop, best of 3: 2.27 s per loop
>>> %timeit df['pcnt_sales2'] = (df["sales"] /
            df.groupby(['location', 'store'])['sales'].transform(sum))
100 loops, best of 3: 6.25 ms per loop
>>> df["pcnt_sales"].equals(df["pcnt_sales2"])
True
like image 136
DSM Avatar answered Oct 06 '22 01:10

DSM