Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign groups based on a maximum sum?

I have a dataframe like this:

df = pd.DataFrame({'keys': list('aaaabbbbccccc'), 'values': [1, 5, 6, 8, 2, 4, 7, 7, 1, 1, 1, 1, 5]})

   keys  values
0     a       1
1     a       5
2     a       6
3     a       8
4     b       2
5     b       4
6     b       7
7     b       7
8     c       1
9     c       1
10    c       1
11    c       1
12    c       5

Further, I have a variable max_sum = 10.

I want to assign a group to each row (i) based on the value in keys and (ii) the max_sum which should not be exceeded per group.

My expected outcome looks like this:

   keys  values  group
0     a       1      1
1     a       5      1
2     a       6      2
3     a       8      3
4     b       2      4
5     b       4      4
6     b       7      5
7     b       7      6
8     c       1      7
9     c       1      7
10    c       1      7
11    c       1      7
12    c       5      7

So, the first two values in the a group (1 and 5) sum up to 6 which is less than 10, so they are in the same group. If we now added also 6, max_sum would be exceeded and therefore this value goes into group 2. We cannot add 8 to this group as then again max_sum would be exceeded, therefore we define a group 3. Same then for the values b and c.

One can do

df['cumsum'] = df.groupby('keys')['values'].cumsum()

   keys  values  cumsum
0     a       1       1
1     a       5       6
2     a       6      12
3     a       8      20
4     b       2       2
5     b       4       6
6     b       7      13
7     b       7      20
8     c       1       1
9     c       1       2
10    c       1       3
11    c       1       4
12    c       5       9

but I don't know how to get the group info from this.

like image 442
Cleb Avatar asked Jun 03 '19 21:06

Cleb


4 Answers

We want to partition rows based on their cumulative sum, so we use cumsum, take the modulus with respect to max_sum, then find the difference to find points where the difference is negative (to mark the next group). We also need to do this per key, so the entire operation described above is done inside a GroupBy.apply call.

(df.groupby('keys')['values']
   .apply(lambda x: x.cumsum().mod(max_sum).diff())
   .fillna(-1)
   .lt(0)
   .cumsum())                 

0     1
1     1
2     2
3     3
4     4
5     4
6     5
7     6
8     7
9     7
10    7
11    7
12    7
Name: values, dtype: int64

In a comment below, I wrote:

@Cleb Looks like my answer here is wrong. For 4, 4, 9, 2, the output should be 1, 1, 2, 3 but my code will assign 1, 1, 2, 2 because cumsum discounts the values.

So, here's my solution to address this corner case. Define a function that assigns groups:

grp = {'grp': 0}  # better than `global`, at least
def func(V):
    cumsum = 0
    grp['grp'] += 1
    grps = []
    for v in V.tolist():
        cumsum += v
        if cumsum > max_sum:
            cumsum = v
            grp['grp'] += 1
        grps.append(grp['grp'])

    return pd.Series(grps)

Now, call apply:

df.groupby('keys')['values'].apply(func).values
# array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 7, 7])
like image 119
cs95 Avatar answered Sep 17 '22 23:09

cs95


We can create two masks, and based on that create a True / False array.

  • m1: All values which are greater then max_sum mark as True else False
  • m2: Rows where the value in previous row keys is not the same as current row.

With np.where we basically have the following in Pseudo code:

when m1 or m2 is True, return True, else False

Now we can translate True and False to 1 / 0 since they are booleans:

True + True

2

That's the reaon for cumsum in the last line.

Code:

max_sum = 10

m1 = df.groupby('keys')['values'].cumsum().gt(max_sum)  # all values which are greater than max_sum 
m2 = df['keys'].ne(df['keys'].shift())                  # all rows where keys change

df['group'] = np.where(m1 | m2, True, False).cumsum()


   keys  values  group
0     a       1      1
1     a       5      1
2     a       6      2
3     a       8      3
4     b       2      4
5     b       4      4
6     b       7      5
7     b       7      6
8     c       1      7
9     c       1      7
10    c       1      7
11    c       1      7
12    c       5      7
like image 34
Erfan Avatar answered Sep 18 '22 23:09

Erfan


My logic , first get the cumsum within each group , then we need get the pervious group's max last group number cumsum assign to next group

s=(df.groupby('keys')['values'].cumsum()//10+1)
s+s.groupby(df['keys']).last().shift().fillna(0).cumsum().reindex(df['keys']).values

Out[24]: 
0     1.0
1     1.0
2     2.0
3     3.0
4     4.0
5     4.0
6     5.0
7     6.0
8     7.0
9     7.0
10    7.0
11    7.0
12    7.0
Name: values, dtype: float64

Another way

pd.factorize(list(zip(df['keys'],df.groupby('keys')['values'].cumsum()//10)))[0]+1
Out[51]: array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 7, 7], dtype=int64)

Method 3 Data From Pir

s=df.groupby('keys')['values'].rolling(2,min_periods=1).sum().gt(10)
s.loc[s.groupby(level=0).head(1).index[1:]]=True
s.cumsum()+1
Out[79]: 
keys    
a     0      1
      1      1
      2      2
      3      3
b     4      4
      5      4
      6      5
      7      6
c     8      7
      9      7
      10     7
      11     7
      12     7
d     13     8
      14     8
      15     9
      16    10
Name: values, dtype: int32
like image 31
BENY Avatar answered Sep 20 '22 23:09

BENY


This is not a vectorizable problem

At least not as far as I can tell

Setup

Consider the expanded example

df = pd.DataFrame({
    'keys': [*'aaaabbbbcccccdddddddd'],
    'values': [*map(int, '156824771111544922252')]
})

Using a generator

def gen_groups(tups, max_sum=10):
    label = 0
    sums = {}
    for key, val in tups:
        if key not in sums:
            label += 1
            sums[key] = 0
        sums[key] += val
        if sums[key] > max_sum:
            # This resets the summation
            # to the first thing that exceeded the max
            sums[key] = val
            label += 1
        yield label

df.assign(group=[*gen_groups(zip(df['keys'], df['values']))])

OUTPUT

   keys  values  group
0     a       1      1
1     a       5      1
2     a       6      2
3     a       8      3
4     b       2      4
5     b       4      4
6     b       7      5
7     b       7      6
8     c       1      7
9     c       1      7
10    c       1      7
11    c       1      7
12    c       5      7
13    d       4      8  # First group for `key == d` 
14    d       4      8  # Still same group because `4 + 4 <= 10`
15    d       9      9  # New group because `4 + 4 + 9 > 10`
16    d       2     10  # New group because `9 + 2 > 10`
17    d       2     10  # Same group because `2 + 2 < = 10`
18    d       2     10  # Same group because `2 + 2 + 2 <= 10`
19    d       5     11  # New Group because `2 + 2 + 2 + 5 > 10`
20    d       2     11  # Same Group because `5 + 2 <= 10`
like image 32
piRSquared Avatar answered Sep 19 '22 23:09

piRSquared