Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Difference to previous value

Tags:

python

pandas

Given a Pandas Data Frame that looks like this

GROUP   VALUE    MASK
  1        5     false
  2       10     false
  2       20     false
  1        7     true
  3        17    false
  3        18    false
  1        100   false
  1        200   true

For every row where the MASK is true I would like to get the difference to the previous value within this group, i.e., the result should be

VALUE DIFF
  7      2
  200  100

How can I compute this in Pandas?

How can I compute this fast for a large data frame with about 2 million rows and 1 million groups?

like image 510
David Avatar asked Oct 21 '16 21:10

David


2 Answers

Actually the bottleneck is groupby. You don't actually need to use groupby for this specific problem. To sort the dataframe by GROUP, perform diff on the sorted dataframe and filter by MASK should be okay. We must use kind='mergesort' to keep the order unchanged within the group before and after sorting,

Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this

pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

Performance Tests:

MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})

%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop

%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop

#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop
like image 98
TurtleIzzy Avatar answered Sep 30 '22 13:09

TurtleIzzy


use groupby, diff, 'MASK'

pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()],
          axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

enter image description here

like image 33
piRSquared Avatar answered Sep 30 '22 14:09

piRSquared