I have a dataframe like this:
df = pd.DataFrame({'keys': list('aaaabbbbccccc'), 'values': [1, 5, 6, 8, 2, 4, 7, 7, 1, 1, 1, 1, 5]})
keys values
0 a 1
1 a 5
2 a 6
3 a 8
4 b 2
5 b 4
6 b 7
7 b 7
8 c 1
9 c 1
10 c 1
11 c 1
12 c 5
Further, I have a variable max_sum = 10
.
I want to assign a group to each row (i) based on the value in keys
and (ii) the max_sum
which should not be exceeded per group.
My expected outcome looks like this:
keys values group
0 a 1 1
1 a 5 1
2 a 6 2
3 a 8 3
4 b 2 4
5 b 4 4
6 b 7 5
7 b 7 6
8 c 1 7
9 c 1 7
10 c 1 7
11 c 1 7
12 c 5 7
So, the first two values in the a
group (1
and 5
) sum up to 6
which is less than 10
, so they are in the same group. If we now added also 6
, max_sum
would be exceeded and therefore this value goes into group 2
. We cannot add 8
to this group as then again max_sum
would be exceeded, therefore we define a group 3
. Same then for the values b
and c
.
One can do
df['cumsum'] = df.groupby('keys')['values'].cumsum()
keys values cumsum
0 a 1 1
1 a 5 6
2 a 6 12
3 a 8 20
4 b 2 2
5 b 4 6
6 b 7 13
7 b 7 20
8 c 1 1
9 c 1 2
10 c 1 3
11 c 1 4
12 c 5 9
but I don't know how to get the group info from this.
We want to partition rows based on their cumulative sum, so we use cumsum
, take the modulus with respect to max_sum
, then find the difference to find points where the difference is negative (to mark the next group). We also need to do this per key, so the entire operation described above is done inside a GroupBy.apply
call.
(df.groupby('keys')['values']
.apply(lambda x: x.cumsum().mod(max_sum).diff())
.fillna(-1)
.lt(0)
.cumsum())
0 1
1 1
2 2
3 3
4 4
5 4
6 5
7 6
8 7
9 7
10 7
11 7
12 7
Name: values, dtype: int64
In a comment below, I wrote:
@Cleb Looks like my answer here is wrong. For 4, 4, 9, 2, the output should be 1, 1, 2, 3 but my code will assign 1, 1, 2, 2 because cumsum discounts the values.
So, here's my solution to address this corner case. Define a function that assigns groups:
grp = {'grp': 0} # better than `global`, at least
def func(V):
cumsum = 0
grp['grp'] += 1
grps = []
for v in V.tolist():
cumsum += v
if cumsum > max_sum:
cumsum = v
grp['grp'] += 1
grps.append(grp['grp'])
return pd.Series(grps)
Now, call apply
:
df.groupby('keys')['values'].apply(func).values
# array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 7, 7])
We can create two masks, and based on that create a True
/ False
array.
max_sum
mark as True
else False
keys
is not the same as current row.With np.where
we basically have the following in Pseudo code:
when m1 or m2 is True, return True, else False
Now we can translate True
and False
to 1 / 0 since they are booleans:
True + True
2
That's the reaon for cumsum
in the last line.
Code:
max_sum = 10
m1 = df.groupby('keys')['values'].cumsum().gt(max_sum) # all values which are greater than max_sum
m2 = df['keys'].ne(df['keys'].shift()) # all rows where keys change
df['group'] = np.where(m1 | m2, True, False).cumsum()
keys values group
0 a 1 1
1 a 5 1
2 a 6 2
3 a 8 3
4 b 2 4
5 b 4 4
6 b 7 5
7 b 7 6
8 c 1 7
9 c 1 7
10 c 1 7
11 c 1 7
12 c 5 7
My logic , first get the cumsum
within each group , then we need get the pervious group's max last group number cumsum
assign to next group
s=(df.groupby('keys')['values'].cumsum()//10+1)
s+s.groupby(df['keys']).last().shift().fillna(0).cumsum().reindex(df['keys']).values
Out[24]:
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 4.0
6 5.0
7 6.0
8 7.0
9 7.0
10 7.0
11 7.0
12 7.0
Name: values, dtype: float64
Another way
pd.factorize(list(zip(df['keys'],df.groupby('keys')['values'].cumsum()//10)))[0]+1
Out[51]: array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 7, 7], dtype=int64)
Method 3 Data From Pir
s=df.groupby('keys')['values'].rolling(2,min_periods=1).sum().gt(10)
s.loc[s.groupby(level=0).head(1).index[1:]]=True
s.cumsum()+1
Out[79]:
keys
a 0 1
1 1
2 2
3 3
b 4 4
5 4
6 5
7 6
c 8 7
9 7
10 7
11 7
12 7
d 13 8
14 8
15 9
16 10
Name: values, dtype: int32
At least not as far as I can tell
Consider the expanded example
df = pd.DataFrame({
'keys': [*'aaaabbbbcccccdddddddd'],
'values': [*map(int, '156824771111544922252')]
})
def gen_groups(tups, max_sum=10):
label = 0
sums = {}
for key, val in tups:
if key not in sums:
label += 1
sums[key] = 0
sums[key] += val
if sums[key] > max_sum:
# This resets the summation
# to the first thing that exceeded the max
sums[key] = val
label += 1
yield label
df.assign(group=[*gen_groups(zip(df['keys'], df['values']))])
OUTPUT
keys values group
0 a 1 1
1 a 5 1
2 a 6 2
3 a 8 3
4 b 2 4
5 b 4 4
6 b 7 5
7 b 7 6
8 c 1 7
9 c 1 7
10 c 1 7
11 c 1 7
12 c 5 7
13 d 4 8 # First group for `key == d`
14 d 4 8 # Still same group because `4 + 4 <= 10`
15 d 9 9 # New group because `4 + 4 + 9 > 10`
16 d 2 10 # New group because `9 + 2 > 10`
17 d 2 10 # Same group because `2 + 2 < = 10`
18 d 2 10 # Same group because `2 + 2 + 2 <= 10`
19 d 5 11 # New Group because `2 + 2 + 2 + 5 > 10`
20 d 2 11 # Same Group because `5 + 2 <= 10`
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With