Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subtract aggregate from Pandas Series/Dataframe [duplicate]

Given the following table

   vals
0    20
1     3
2     2
3    10
4    20

I'm trying to find a clean solution in pandas to subtract away a value, say 30 for example, to end with the following result.

   vals
0     0
1     0
2     0
3     5
4    20

I was wondering if pandas had a solution to performing this that didn't require looping through all the rows in a dataframe, something that takes advantage of pandas's bulk operations.

like image 960
jab Avatar asked May 18 '17 18:05

jab


People also ask

How do you subtract Panda series?

subtract() function basically perform subtraction of series and other, element-wise (binary operator sub). It is equivalent to series - other , but with support to substitute a fill_value for missing data in one of the inputs.

How do you subtract a DataFrame from a different DataFrame in Python?

subtract() function is used for finding the subtraction of dataframe and other, element-wise. This function is essentially same as doing dataframe – other but with a support to substitute for missing data in one of the inputs.

What is a correct method to remove duplicates from a Pandas DataFrame?

Pandas drop_duplicates() method helps in removing duplicates from the data frame.


1 Answers

  • identify where cumsum is greater than or equal to 30
  • mask the rows where it isn't
  • reassign the one row to be the cumsum less 30

c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)

   vals
0     0
1     0
2     0
3     5
4    20

Same thing, but numpyfied

v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)

   vals
0     0
1     0
2     0
3     5
4    20

Timing

%%timeit 
v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)
10000 loops, best of 3: 168 µs per loop

%%timeit
c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)
1000 loops, best of 3: 853 µs per loop
like image 186
piRSquared Avatar answered Sep 29 '22 19:09

piRSquared