I have a dataframe like this:
df = pd.DataFrame({'keys': list('aaaabbbbccccc'), 'values': [1, 5, 6, 8, 2, 4, 7, 7, 1, 1, 1, 1, 5]})
keys values
0 a 1
1 a 5
2 a 6
3 a 8
4 b 2
5 b 4
6 b 7
7 b 7
8 c 1
9 c 1
10 c 1
11 c 1
12 c 5
Further, I have a variable max_sum = 10.
I want to assign a group to each row (i) based on the value in keys and (ii) the max_sum which should not be exceeded per group.
My expected outcome looks like this:
keys values group
0 a 1 1
1 a 5 1
2 a 6 2
3 a 8 3
4 b 2 4
5 b 4 4
6 b 7 5
7 b 7 6
8 c 1 7
9 c 1 7
10 c 1 7
11 c 1 7
12 c 5 7
So, the first two values in the a group (1 and 5) sum up to 6 which is less than 10, so they are in the same group. If we now added also 6, max_sum would be exceeded and therefore this value goes into group 2. We cannot add 8 to this group as then again max_sum would be exceeded, therefore we define a group 3. Same then for the values b and c.
One can do
df['cumsum'] = df.groupby('keys')['values'].cumsum()
keys values cumsum
0 a 1 1
1 a 5 6
2 a 6 12
3 a 8 20
4 b 2 2
5 b 4 6
6 b 7 13
7 b 7 20
8 c 1 1
9 c 1 2
10 c 1 3
11 c 1 4
12 c 5 9
but I don't know how to get the group info from this.
We want to partition rows based on their cumulative sum, so we use cumsum, take the modulus with respect to max_sum, then find the difference to find points where the difference is negative (to mark the next group). We also need to do this per key, so the entire operation described above is done inside a GroupBy.apply call.
(df.groupby('keys')['values']
.apply(lambda x: x.cumsum().mod(max_sum).diff())
.fillna(-1)
.lt(0)
.cumsum())
0 1
1 1
2 2
3 3
4 4
5 4
6 5
7 6
8 7
9 7
10 7
11 7
12 7
Name: values, dtype: int64
In a comment below, I wrote:
@Cleb Looks like my answer here is wrong. For 4, 4, 9, 2, the output should be 1, 1, 2, 3 but my code will assign 1, 1, 2, 2 because cumsum discounts the values.
So, here's my solution to address this corner case. Define a function that assigns groups:
grp = {'grp': 0} # better than `global`, at least
def func(V):
cumsum = 0
grp['grp'] += 1
grps = []
for v in V.tolist():
cumsum += v
if cumsum > max_sum:
cumsum = v
grp['grp'] += 1
grps.append(grp['grp'])
return pd.Series(grps)
Now, call apply:
df.groupby('keys')['values'].apply(func).values
# array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 7, 7])
We can create two masks, and based on that create a True / False array.
max_sum mark as True else False
keys is not the same as current row.With np.where we basically have the following in Pseudo code:
when m1 or m2 is True, return True, else False
Now we can translate True and False to 1 / 0 since they are booleans:
True + True
2
That's the reaon for cumsum in the last line.
Code:
max_sum = 10
m1 = df.groupby('keys')['values'].cumsum().gt(max_sum) # all values which are greater than max_sum
m2 = df['keys'].ne(df['keys'].shift()) # all rows where keys change
df['group'] = np.where(m1 | m2, True, False).cumsum()
keys values group
0 a 1 1
1 a 5 1
2 a 6 2
3 a 8 3
4 b 2 4
5 b 4 4
6 b 7 5
7 b 7 6
8 c 1 7
9 c 1 7
10 c 1 7
11 c 1 7
12 c 5 7
My logic , first get the cumsum within each group , then we need get the pervious group's max last group number cumsum assign to next group
s=(df.groupby('keys')['values'].cumsum()//10+1)
s+s.groupby(df['keys']).last().shift().fillna(0).cumsum().reindex(df['keys']).values
Out[24]:
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 4.0
6 5.0
7 6.0
8 7.0
9 7.0
10 7.0
11 7.0
12 7.0
Name: values, dtype: float64
Another way
pd.factorize(list(zip(df['keys'],df.groupby('keys')['values'].cumsum()//10)))[0]+1
Out[51]: array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 7, 7], dtype=int64)
Method 3 Data From Pir
s=df.groupby('keys')['values'].rolling(2,min_periods=1).sum().gt(10)
s.loc[s.groupby(level=0).head(1).index[1:]]=True
s.cumsum()+1
Out[79]:
keys
a 0 1
1 1
2 2
3 3
b 4 4
5 4
6 5
7 6
c 8 7
9 7
10 7
11 7
12 7
d 13 8
14 8
15 9
16 10
Name: values, dtype: int32
At least not as far as I can tell
Consider the expanded example
df = pd.DataFrame({
'keys': [*'aaaabbbbcccccdddddddd'],
'values': [*map(int, '156824771111544922252')]
})
def gen_groups(tups, max_sum=10):
label = 0
sums = {}
for key, val in tups:
if key not in sums:
label += 1
sums[key] = 0
sums[key] += val
if sums[key] > max_sum:
# This resets the summation
# to the first thing that exceeded the max
sums[key] = val
label += 1
yield label
df.assign(group=[*gen_groups(zip(df['keys'], df['values']))])
OUTPUT
keys values group
0 a 1 1
1 a 5 1
2 a 6 2
3 a 8 3
4 b 2 4
5 b 4 4
6 b 7 5
7 b 7 6
8 c 1 7
9 c 1 7
10 c 1 7
11 c 1 7
12 c 5 7
13 d 4 8 # First group for `key == d`
14 d 4 8 # Still same group because `4 + 4 <= 10`
15 d 9 9 # New group because `4 + 4 + 9 > 10`
16 d 2 10 # New group because `9 + 2 > 10`
17 d 2 10 # Same group because `2 + 2 < = 10`
18 d 2 10 # Same group because `2 + 2 + 2 <= 10`
19 d 5 11 # New Group because `2 + 2 + 2 + 5 > 10`
20 d 2 11 # Same Group because `5 + 2 <= 10`
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With