I have a DataFrame like this:
df = pd.DataFrame({
'val_a': [3, 3, 3, 2, 2, 2, 1, 1, 1],
'val_b': [3, np.nan, 2, 2, 2, 0, 1, np.nan, 0],
'quantity': [1, 4, 2, 8, 5, 7, 1, 4, 2]
})
It looks like this:
| | val_a | val_b | quantity |
|---:|--------:|--------:|-----------:|
| 0 | 3 | 3 | 1 |
| 1 | 3 | nan | 4 |
| 2 | 3 | 2 | 2 |
| 3 | 2 | 2 | 8 |
| 4 | 2 | 2 | 5 |
| 5 | 2 | 0 | 7 |
| 6 | 1 | 1 | 1 |
| 7 | 1 | nan | 4 |
| 8 | 1 | 0 | 2 |
It is ordered by val_a
. I'd like to take a cumulative sum for the total quantity
for each val_a
. So:
df.groupby('val_a', sort=False).sum().cumsum().drop(columns='val_b')
which gives
| val_a | quantity |
|--------:|-----------:|
| 3 | 7 |
| 2 | 27 |
| 1 | 34 |
However, here's the tricky part.
I'd like to exclude rows such that the value of val_b
is greater than the key val_a
. I'll clarify with an example:
val_a
is 3
, none of the rows have val_b
greater than val_a
. So the cumulative total for when val_a
is 3
is 7
;val_a
is 2
, then row 0 has val_b
greater than 2
. That row has quantity
1. So, excluding that row, the cumulative total for when val_a
is 2
is 27 - 1
, i.e. 26
;val_a
is 1
, then rows 0, 2, 3, 4 have val_b
greater than 1
,. That row has quantity
1. So, excluding that row, the cumulative total for when val_a
is 1
is 34 - 1 - 2 - 8 - 5
, i.e. 18
;Here's the desired output:
| val_a | quantity |
|--------:|-----------:|
| 3 | 7 |
| 2 | 26 |
| 1 | 18 |
With the help of NumPy:
# sum without conditions
raw_sum = df.groupby("val_a", sort=False).quantity.sum().cumsum()
# comparing each `val_b` against each unique `val_a` via `gt.outer`
sub_mask = np.greater.outer(df.val_b.to_numpy(), df.val_a.unique())
# selecting values to subtract from `quantity` and summing per `val_a`
to_sub = (sub_mask * df.quantity.to_numpy()[:, np.newaxis]).sum(axis=0)
# subtracting from the raw sum
result = raw_sum - to_sub
to get
>>> result.reset_index()
val_a quantity
0 3 7
1 2 26
2 1 18
You could try this:
# Your initial dataframe
group_df = (
df.groupby("val_a").sum().cumsum().drop(columns="val_b").reset_index(drop=False)
)
# Pairing val_a and inverse index number
pairs = {
i: j - 1
for i, j in zip(group_df["val_a"].values, reversed(group_df["val_a"].values))
}
# Iterating according to the logic of your example
for i, j in pairs.items():
group_df.loc[j, "quantity"] = (
group_df.loc[j, "quantity"] - df.loc[df["val_b"] > i, "quantity"].sum()
)
# Your desired output
print(group_df)
val_a quantity
0 1 7
1 2 26
2 3 18
Here is a function to calculate the correction factor:
df2 = df.set_index('val_a')
def cond_sub(s):
return df2.loc[:s.name][(df2.loc[:s.name]['val_b']>s.name)]['quantity'].sum()
output:
1 16
2 1
3 0
You can then combine it with your original function:
(df.groupby('val_a', sort=False).sum().cumsum().drop(columns='val_b')['quantity']
-df.groupby('val_a', sort=False).sum().apply(cond_sub, axis=1)
)
output:
3 7
2 26
1 18
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With