Pandas: Difference to previous value

Question

Given a Pandas Data Frame that looks like this

GROUP   VALUE    MASK
  1        5     false
  2       10     false
  2       20     false
  1        7     true
  3        17    false
  3        18    false
  1        100   false
  1        200   true

For every row where the MASK is true I would like to get the difference to the previous value within this group, i.e., the result should be

VALUE DIFF
  7      2
  200  100

How can I compute this in Pandas?

How can I compute this fast for a large data frame with about 2 million rows and 1 million groups?

TurtleIzzy · Accepted Answer

Actually the bottleneck is groupby. You don't actually need to use groupby for this specific problem. To sort the dataframe by GROUP, perform diff on the sorted dataframe and filter by MASK should be okay. We must use kind='mergesort' to keep the order unchanged within the group before and after sorting,

Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this

pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

Performance Tests:

MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})

%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop

%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop

#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop

piRSquared · Answer

use groupby, diff, 'MASK'

pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()],
          axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

enter image description here

Pandas: Difference to previous value

Tags:

python

pandas

David

2 Answers

TurtleIzzy

piRSquared

Recent Activity

Donate For Us

Pandas: Difference to previous value

Tags:

python

pandas

David

2 Answers

TurtleIzzy

piRSquared

Related questions

Recent Activity

Donate For Us