Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Collapse first n rows in each group by aggregation

Tags:

python

pandas

I have a dataframe which is grouped by id. There are many groups, and each group has a variable number of rows. The first three rows of all groups do not contain interesting data. I would like to "collapse" the first three rows in each group to form a single row in the following way:

'id', and 'type' will remain the same in the new 'collapsed' row.
'grp_idx' will be renamed "0" when the aggregation of the first three rows occurs
col_1 will be the sum of the first three rows
col_2 will be the sum of the first three rows
The 'flag' in the "collapsed" row will be 0 if the values are all 0 in the first 3 rows. 'flag' will be 1 if it is 1 in any of the first three rows. (A simple sum will suffice for this logic, since the flag is only set in one row for all groups)

Here is an example of what the dataframe looks like:

import pandas as pd
import numpy as np   
df = pd.DataFrame.from_items([
    ('id', [283,283,283,283,283,283,283,756,756,756]), 
    ('type', ['A','A','A','A','A','A','A','X','X','X']),
    ('grp_idx', [1,2,3,4,5,6,7,1,2,3]),
    ('col_1', [2,4,6,8,10,12,14,5,10,15]),
    ('col_2', [3,6,9,12,15,18,21,1,2,3]),
    ('flag', [0,0,0,0,0,0,1,0,0,1]),
    ]);
print(df)

    id   type  grp_idx  col_1  col_2  flag
0  283    A        1      2      3     0
1  283    A        2      4      6     0
2  283    A        3      6      9     0
3  283    A        4      8     12     0
4  283    A        5     10     15     0
5  283    A        6     12     18     0
6  283    A        7     14     21     1
7  756    X        1      5      1     0
8  756    X        2     10      2     0
9  756    X        3     15      3     1

After processing, I expect the dataframe to look like:

ID  Type   grp_idx  col_1  col_2   flag
283  A         0     12      18      0
283  A         4     8       12      0
283  A         5     10      15      0
283  A         6     12      18      0
283  A         7     14      21      1
756  X         0     30       6      1

I'm not sure how to proceed. I was trying to play around with

df.groupby('id').head(3).sum()

but this is not doing what I need. Any help, suggestions, code snippet would be really appreciated.

like image 306
Learner Avatar asked Apr 06 '16 18:04

Learner


2 Answers

I was trying to play around with

df.groupby('id').head(3).sum()

After you call groupby(), you need to aggregate() in order to combine in the way you want. Try something like this:

# function to sum the first 3 rows
def head_sum(x):
    return x.head(3).sum()

# function to get max of first 3 rows
def head_max(x):
    return x.head(3).max()

# We can use a dictionary in `aggregate()` to call a 
# specific function for each column in the groupby
column_funcs = {'col_1': head_sum,
                'col_2': head_sum,
                'flag': head_max,
                'id': max,  # all the vals should be the same
                'type': max}  # are the 'id' and 'type' always matched?
collapsed = df.groupby('id').aggregate(column_funcs)
collapsed['grp_idx'] = 0

new_df = pd.concat([df, collapsed])

See here for a lot more info on the split-apply-combine approach.

like image 61
Zachary Cross Avatar answered Oct 05 '22 20:10

Zachary Cross


You can start by setting the grp_idx:

df["grp_idx"] = np.where(df.groupby("id").cumcount()<3, 0, df["grp_idx"])

Now id and grp_idx create the grouping you want:

df.groupby(["id", "type", "grp_idx"]).sum().reset_index()

    id  type    grp_idx col_1   col_2   flag
0   283 A       0       12      18      0
1   283 A       4       8       12      0
2   283 A       5       10      15      0
3   283 A       6       12      18      0
4   283 A       7       14      21      1
5   756 X       0       30      6       1

I assumed the type cannot be different for the same id as you didn't give any conditions for that column. I also assumed the df is sorted by id. If not, you can first sort it for grp_idx to be correct.

like image 32
ayhan Avatar answered Oct 05 '22 22:10

ayhan