Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas groupby with cumsum and percentage

Given the following dataframe df:

      app       platform    uuid                              minutes
0     1         0  a696ccf9-22cb-428b-adee-95c9a97a4581       67
1     2         0  8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d        1
2     2         1  40AD6CD1-4A7B-48DD-8815-1829C093A95C       13
3     1         0  26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f        2
4     2         0  34271596-eebb-4423-b890-dc3761ed37ca        8
5     3         1  C57D0F52-B565-4322-85D2-C2798F7CA6FF       16
6     2         0      245501ec2e39cb782bab1fb02d7813b7        1
7     3         1  DE6E4714-5A3C-4C80-BD81-EAACB2364DF0       30
8     3         0  f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e       10
9     2         0  9c08c860-7a6d-4810-a5c3-f3af2a3fcf66      470
10    3         1      19fdaedfd0dbdaf6a7a6b49619f11a19        3
11    3         1  AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B       58
12    2         0  4eb1024b-c293-42a4-95a2-31b20c3b524b       24
13    3         1  8E0B0BE3-8553-4F38-9837-6C907E01F84C        7
14    3         1  E8B2849C-F050-4DCD-B311-5D57015466AE      465
15    2         0  ec7fedb6-b118-424a-babe-b8ffad579685      266
16    1         0  7e302dcb-ceaf-406c-a9e5-66933d921064      184
17    2         0      f786528ded200c9f553dd3a5e9e9bb2d       10
18    3         1  1E291633-AF27-4DFB-8DA4-4A5B63F175CF       13
19    2         0  953a525c-97e0-4c2f-90e0-dfebde3ec20d     2408`

I'll group it:

y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])

                                                   minutes
app platform uuid                                         
1   0        7e302dcb-ceaf-406c-a9e5-66933d921064      184
             a696ccf9-22cb-428b-adee-95c9a97a4581       67
             26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f        2
2   0        953a525c-97e0-4c2f-90e0-dfebde3ec20d     2408
             9c08c860-7a6d-4810-a5c3-f3af2a3fcf66      470
             ec7fedb6-b118-424a-babe-b8ffad579685      266
             4eb1024b-c293-42a4-95a2-31b20c3b524b       24
             f786528ded200c9f553dd3a5e9e9bb2d           10
             34271596-eebb-4423-b890-dc3761ed37ca        8
             245501ec2e39cb782bab1fb02d7813b7            1
             8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d        1
    1        40AD6CD1-4A7B-48DD-8815-1829C093A95C       13
3   0        f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e       10
    1        E8B2849C-F050-4DCD-B311-5D57015466AE      465
             AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B       58
             DE6E4714-5A3C-4C80-BD81-EAACB2364DF0       30
             C57D0F52-B565-4322-85D2-C2798F7CA6FF       16
             1E291633-AF27-4DFB-8DA4-4A5B63F175CF       13
             8E0B0BE3-8553-4F38-9837-6C907E01F84C        7
             19fdaedfd0dbdaf6a7a6b49619f11a19            3

So that I got its minutes per uuid in decrescent order.

Now, I will sum the cumulative minutes per app/platform/uuid:

y.groupby(level=[0,1]).cumsum()


app platform uuid                                         
1   0        7e302dcb-ceaf-406c-a9e5-66933d921064      184
             a696ccf9-22cb-428b-adee-95c9a97a4581      251
             26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f      253
2   0        953a525c-97e0-4c2f-90e0-dfebde3ec20d     2408
             9c08c860-7a6d-4810-a5c3-f3af2a3fcf66     2878
             ec7fedb6-b118-424a-babe-b8ffad579685     3144
             4eb1024b-c293-42a4-95a2-31b20c3b524b     3168
             f786528ded200c9f553dd3a5e9e9bb2d         3178
             34271596-eebb-4423-b890-dc3761ed37ca     3186
             245501ec2e39cb782bab1fb02d7813b7         3187
             8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d     3188
    1        40AD6CD1-4A7B-48DD-8815-1829C093A95C       13
3   0        f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e       10
    1        E8B2849C-F050-4DCD-B311-5D57015466AE      465
             AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B      523
             DE6E4714-5A3C-4C80-BD81-EAACB2364DF0      553
             C57D0F52-B565-4322-85D2-C2798F7CA6FF      569
             1E291633-AF27-4DFB-8DA4-4A5B63F175CF      582
             8E0B0BE3-8553-4F38-9837-6C907E01F84C      589
             19fdaedfd0dbdaf6a7a6b49619f11a19          592

My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:

app platform uuid                                         
1   0        7e302dcb-ceaf-406c-a9e5-66933d921064      184    0.26
             a696ccf9-22cb-428b-adee-95c9a97a4581      251    0.36
             26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f      253    0.36
...
...
...
like image 604
jbastos Avatar asked Oct 20 '22 03:10

jbastos


1 Answers

It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:

y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])

Should give the right output.

In [398]: y['running_pct'].head()
Out[398]: 
app  platform  uuid                                
1    0         7e302dcb-ceaf-406c-a9e5-66933d921064    0.727273
               a696ccf9-22cb-428b-adee-95c9a97a4581    0.992095
               26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f    1.000000
2    0         953a525c-97e0-4c2f-90e0-dfebde3ec20d    0.755332
               9c08c860-7a6d-4810-a5c3-f3af2a3fcf66    0.902760
Name: running_pct, dtype: float64

EDIT:

Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1

y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')

And as @Jeff notes, in 0.15.0 this may be faster yet.

y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')
like image 127
chrisb Avatar answered Oct 27 '22 00:10

chrisb