I have data like this
location sales store
0 68 583 17
1 28 857 2
2 55 190 59
3 98 517 64
4 94 892 79
...
For each unique pair (location, store), there are 1 or more sales. I want to add a column, pcnt_sales
that shows what percent of the total sales for that (location, store) pair was made up by the sale in the given row.
location sales store pcnt_sales
0 68 583 17 0.254363
1 28 857 2 0.346543
2 55 190 59 1.000000
3 98 517 64 0.272105
4 94 892 79 1.000000
...
This works, but is slow
import pandas as pd
import numpy as np
df = pd.DataFrame({'location':np.random.randint(0, 100, 10000), 'store':np.random.randint(0, 100, 10000), 'sales': np.random.randint(0, 1000, 10000)})
import timeit
start_time = timeit.default_timer()
df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
print(timeit.default_timer() - start_time) # 1.46 seconds
By comparison, R's data.table
does this super fast
library(data.table)
dt <- data.table(location=sample(100, size=10000, replace=TRUE), store=sample(100, size=10000, replace=TRUE), sales=sample(1000, size=10000, replace=TRUE))
ptm <- proc.time()
dt[, pcnt_sales:=sales/sum(sales), by=c("location", "store")]
proc.time() - ptm # 0.007 seconds
How do I do this efficiently in Pandas (especially considering my real dataset has millions of rows)?
For performance you want to avoid apply
. You could use transform
to get the result of the groupby expanded to the original index instead, at which point a division would work at vectorized speed:
>>> %timeit df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
1 loop, best of 3: 2.27 s per loop
>>> %timeit df['pcnt_sales2'] = (df["sales"] /
df.groupby(['location', 'store'])['sales'].transform(sum))
100 loops, best of 3: 6.25 ms per loop
>>> df["pcnt_sales"].equals(df["pcnt_sales2"])
True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With