Given a Pandas Data Frame that looks like this
GROUP VALUE MASK
1 5 false
2 10 false
2 20 false
1 7 true
3 17 false
3 18 false
1 100 false
1 200 true
For every row where the MASK is true I would like to get the difference to the previous value within this group, i.e., the result should be
VALUE DIFF
7 2
200 100
How can I compute this in Pandas?
How can I compute this fast for a large data frame with about 2 million rows and 1 million groups?
Actually the bottleneck is groupby. You don't actually need to use groupby for this specific problem. To sort the dataframe by GROUP, perform diff on the sorted dataframe and filter by MASK should be okay. We must use kind='mergesort' to keep the order unchanged within the group before and after sorting,
Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this
pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
Performance Tests:
MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})
%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop
#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop
use groupby, diff, 'MASK'
pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()],
axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With