Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby()
...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join
aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age
column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
group_df
, separate the namescreate empty dataframe. from every line in group_df , separate the names. return a dataframe with as many rows as there are names in the starting row. append the output to the empty dataframe.
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns.
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.
The rough equivalent is .reset_index()
, but it may not be helpful to think of it as the "opposite" of groupby()
.
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index()
at the end to get your desired result.
It turns out that pd.groupby()
returns an object with the original data stored in obj
. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With