Given a Pandas Data Frame that looks like this
GROUP VALUE MASK
1 5 false
2 10 false
2 20 false
1 7 true
3 17 false
3 18 false
1 100 false
1 200 true
For every row where the MASK is true I would like to get the difference to the previous value within this group, i.e., the result should be
VALUE DIFF
7 2
200 100
How can I compute this in Pandas?
How can I compute this fast for a large data frame with about 2 million rows and 1 million groups?
Actually the bottleneck is groupby
. You don't actually need to use groupby
for this specific problem. To sort
the dataframe by GROUP
, perform diff
on the sorted dataframe and filter
by MASK
should be okay. We must use kind='mergesort'
to keep the order unchanged within the group before and after sorting,
Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this
pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
Performance Tests:
MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})
%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop
#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop
use groupby
, diff
, 'MASK'
pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()],
axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With