I have a dataframe which is grouped by id. There are many groups, and each group has a variable number of rows. The first three rows of all groups do not contain interesting data. I would like to "collapse" the first three rows in each group to form a single row in the following way:
'id', and 'type' will remain the same in the new 'collapsed' row.
'grp_idx' will be renamed "0" when the aggregation of the first three rows occurs
col_1 will be the sum of the first three rows
col_2 will be the sum of the first three rows
The 'flag' in the "collapsed" row will be 0 if the values are all 0 in the first 3 rows. 'flag' will be 1 if it is 1 in any of the first three rows. (A simple sum will suffice for this logic, since the flag is only set in one row for all groups)
Here is an example of what the dataframe looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_items([
('id', [283,283,283,283,283,283,283,756,756,756]),
('type', ['A','A','A','A','A','A','A','X','X','X']),
('grp_idx', [1,2,3,4,5,6,7,1,2,3]),
('col_1', [2,4,6,8,10,12,14,5,10,15]),
('col_2', [3,6,9,12,15,18,21,1,2,3]),
('flag', [0,0,0,0,0,0,1,0,0,1]),
]);
print(df)
id type grp_idx col_1 col_2 flag
0 283 A 1 2 3 0
1 283 A 2 4 6 0
2 283 A 3 6 9 0
3 283 A 4 8 12 0
4 283 A 5 10 15 0
5 283 A 6 12 18 0
6 283 A 7 14 21 1
7 756 X 1 5 1 0
8 756 X 2 10 2 0
9 756 X 3 15 3 1
After processing, I expect the dataframe to look like:
ID Type grp_idx col_1 col_2 flag
283 A 0 12 18 0
283 A 4 8 12 0
283 A 5 10 15 0
283 A 6 12 18 0
283 A 7 14 21 1
756 X 0 30 6 1
I'm not sure how to proceed. I was trying to play around with
df.groupby('id').head(3).sum()
but this is not doing what I need. Any help, suggestions, code snippet would be really appreciated.
I was trying to play around with
df.groupby('id').head(3).sum()
After you call groupby()
, you need to aggregate()
in order to combine in the way you want. Try something like this:
# function to sum the first 3 rows
def head_sum(x):
return x.head(3).sum()
# function to get max of first 3 rows
def head_max(x):
return x.head(3).max()
# We can use a dictionary in `aggregate()` to call a
# specific function for each column in the groupby
column_funcs = {'col_1': head_sum,
'col_2': head_sum,
'flag': head_max,
'id': max, # all the vals should be the same
'type': max} # are the 'id' and 'type' always matched?
collapsed = df.groupby('id').aggregate(column_funcs)
collapsed['grp_idx'] = 0
new_df = pd.concat([df, collapsed])
See here for a lot more info on the split-apply-combine approach.
You can start by setting the grp_idx
:
df["grp_idx"] = np.where(df.groupby("id").cumcount()<3, 0, df["grp_idx"])
Now id
and grp_idx
create the grouping you want:
df.groupby(["id", "type", "grp_idx"]).sum().reset_index()
id type grp_idx col_1 col_2 flag
0 283 A 0 12 18 0
1 283 A 4 8 12 0
2 283 A 5 10 15 0
3 283 A 6 12 18 0
4 283 A 7 14 21 1
5 756 X 0 30 6 1
I assumed the type cannot be different for the same id as you didn't give any conditions for that column. I also assumed the df is sorted by id. If not, you can first sort it for grp_idx
to be correct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With