Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative sum but conditionally excluding earlier rows

I have a DataFrame like this:

df = pd.DataFrame({
  'val_a': [3, 3, 3, 2, 2, 2, 1, 1, 1],
  'val_b': [3, np.nan, 2, 2, 2, 0, 1, np.nan, 0],
  'quantity': [1, 4, 2, 8, 5, 7, 1, 4, 2]
})

It looks like this:

|    |   val_a |   val_b |   quantity |
|---:|--------:|--------:|-----------:|
|  0 |       3 |       3 |          1 |
|  1 |       3 |     nan |          4 |
|  2 |       3 |       2 |          2 |
|  3 |       2 |       2 |          8 |
|  4 |       2 |       2 |          5 |
|  5 |       2 |       0 |          7 |
|  6 |       1 |       1 |          1 |
|  7 |       1 |     nan |          4 |
|  8 |       1 |       0 |          2 |

It is ordered by val_a. I'd like to take a cumulative sum for the total quantity for each val_a. So:

df.groupby('val_a', sort=False).sum().cumsum().drop(columns='val_b')

which gives

|   val_a |   quantity |
|--------:|-----------:|
|       3 |          7 |
|       2 |         27 |
|       1 |         34 |

However, here's the tricky part.

I'd like to exclude rows such that the value of val_b is greater than the key val_a. I'll clarify with an example:

  • when calculating the total for when val_a is 3, none of the rows have val_b greater than val_a. So the cumulative total for when val_a is 3 is 7;
  • when calculating the total for when val_a is 2, then row 0 has val_b greater than 2. That row has quantity 1. So, excluding that row, the cumulative total for when val_a is 2 is 27 - 1, i.e. 26;
  • when calculating the total for when val_a is 1, then rows 0, 2, 3, 4 have val_b greater than 1,. That row has quantity 1. So, excluding that row, the cumulative total for when val_a is 1 is 34 - 1 - 2 - 8 - 5, i.e. 18;

Here's the desired output:

|   val_a |   quantity |
|--------:|-----------:|
|       3 |          7 |
|       2 |         26 |
|       1 |         18 |
like image 748
ignoring_gravity Avatar asked Jul 14 '21 16:07

ignoring_gravity


3 Answers

With the help of NumPy:

# sum without conditions
raw_sum = df.groupby("val_a", sort=False).quantity.sum().cumsum()

# comparing each `val_b` against each unique `val_a` via `gt.outer`
sub_mask = np.greater.outer(df.val_b.to_numpy(), df.val_a.unique())

# selecting values to subtract from `quantity` and summing per `val_a`
to_sub = (sub_mask * df.quantity.to_numpy()[:, np.newaxis]).sum(axis=0)

# subtracting from the raw sum
result = raw_sum - to_sub

to get

>>> result.reset_index()

   val_a  quantity
0      3         7
1      2        26
2      1        18
like image 192
Mustafa Aydın Avatar answered Oct 22 '22 07:10

Mustafa Aydın


You could try this:

# Your initial dataframe
group_df = (
    df.groupby("val_a").sum().cumsum().drop(columns="val_b").reset_index(drop=False)
)

# Pairing val_a and inverse index number
pairs = {
    i: j - 1
    for i, j in zip(group_df["val_a"].values, reversed(group_df["val_a"].values))
}

# Iterating according to the logic of your example
for i, j in pairs.items():
    group_df.loc[j, "quantity"] = (
        group_df.loc[j, "quantity"] - df.loc[df["val_b"] > i, "quantity"].sum()
    )

# Your desired output
print(group_df)
   val_a  quantity
0      1         7
1      2        26
2      3        18
like image 23
Laurent Avatar answered Oct 22 '22 06:10

Laurent


Here is a function to calculate the correction factor:

df2 = df.set_index('val_a')
def cond_sub(s):
    return df2.loc[:s.name][(df2.loc[:s.name]['val_b']>s.name)]['quantity'].sum()

output:

1    16
2     1
3     0

You can then combine it with your original function:

(df.groupby('val_a', sort=False).sum().cumsum().drop(columns='val_b')['quantity']
-df.groupby('val_a', sort=False).sum().apply(cond_sub, axis=1)
)

output:

3     7
2    26
1    18
like image 41
mozway Avatar answered Oct 22 '22 07:10

mozway