I have a DataFrame with a multi-index as a column. I want to group by level 1 and apply a function that generates a new column. I want this calculated column to be added to each group, and thus my dataframe to have new columns for each group.
I made a little dummy script and function to replicate what I want to do.
import pandas as pd
import numpy as np
columns = [('A','julian'),('A','geoffrey'),
('B','julian'),('B','geoffrey'),
('C','julian'),('C','geoffrey')]
columns = pd.MultiIndex.from_tuples(columns)
dataframe = pd.DataFrame(data=np.random.rand(10,6),columns=columns)
def addColumn(inputDF):
group = inputDF.columns[0][1]
inputDF['sum', group] = inputDF.sum(axis=1)
return inputDF
newColumnsDataframe = dataframe.groupby(level=1, axis=1).apply(addColumn)
The original dataframe looks as follows:
A B C
julian geoffrey julian geoffrey julian geoffrey
0 0.204082 0.073676 0.795725 0.279702 0.258185 0.258112
1 0.263235 0.096733 0.507324 0.541198 0.525919 0.757652
2 0.196243 0.028613 0.653408 0.364365 0.174911 0.924733
3 0.528785 0.831569 0.654160 0.738029 0.940831 0.294473
4 0.853517 0.263250 0.803087 0.855270 0.701937 0.264698
5 0.239797 0.069519 0.943544 0.374411 0.189361 0.846647
6 0.980734 0.290414 0.850097 0.873785 0.903645 0.118713
7 0.591942 0.088387 0.566298 0.062140 0.568482 0.872064
8 0.818167 0.061483 0.282050 0.008404 0.449198 0.658370
9 0.217424 0.427602 0.471933 0.171458 0.390549 0.234426
The resulting dataframe should look as follows (I constructed the sum DataFrame separately and concatenated the two dataframes to achieve this results):
A B C sum A B C \
geoffrey geoffrey geoffrey geoffrey julian julian julian
0 0.073676 0.279702 0.258112 0.611491 0.204082 0.795725 0.258185
1 0.096733 0.541198 0.757652 1.395584 0.263235 0.507324 0.525919
2 0.028613 0.364365 0.924733 1.317710 0.196243 0.653408 0.174911
3 0.831569 0.738029 0.294473 1.864071 0.528785 0.654160 0.940831
4 0.263250 0.855270 0.264698 1.383219 0.853517 0.803087 0.701937
5 0.069519 0.374411 0.846647 1.290578 0.239797 0.943544 0.189361
6 0.290414 0.873785 0.118713 1.282912 0.980734 0.850097 0.903645
7 0.088387 0.062140 0.872064 1.022590 0.591942 0.566298 0.568482
8 0.061483 0.008404 0.658370 0.728257 0.818167 0.282050 0.449198
9 0.427602 0.171458 0.234426 0.833486 0.217424 0.471933 0.390549
sum
julian
0 1.257992
1 1.296478
2 1.024561
3 2.123776
4 2.358542
5 1.372703
6 2.734476
7 1.726721
8 1.549415
9 1.079906
The approach in the script above is based on what makes sense to me and what others have written online about doing this sort of thing. However, newColumnsDataframe still only has 6 columns, not 8 (one added for each name).
I did notice that when I group by level=0 (so by A, B, or C) and use transform(but NOT when I use apply on this level), newColumnsDataframe does have 9 columns, one sum column added for each group. See the code below:
import pandas as pd
import numpy as np
columns = [('A','julian'),('A','geoffrey'),
('B','julian'),('B','geoffrey'),
('C','julian'),('C','geoffrey')]
columns = pd.MultiIndex.from_tuples(columns)
dataframe = pd.DataFrame(data=np.random.rand(10,6),columns=columns)
def addColumn(inputDF):
group = inputDF.columns[0][1]
inputDF[group, 'sum'] = inputDF.sum(axis=1)
return inputDF
newColumnsDataframe = dataframe.groupby(level=0, axis=1).transform(addColumn)
It was always my understanding that transform worked on each column within the group, whereas apply operated on the group as a whole dataframe. This seems to contradict that. I also noticed, that when I group by level=1 and use transform in stead of apply, it throws the following error:
ValueError: Length mismatch: Expected axis has 10 elements, new values have 6 elements
I am very confused about what is going on. Does anyone know why this DOES work when I use transform and group on level=0. Why there is an error when I do the same but group on level=1. And why grouping on EITHER level and APPLYING the function does not add columns to my final dataframe? Thanks in advance!
(PS: This is not the actual DataFrame or function I am using to add a column, just an easier illustration)
Kind of messy, but a one-liner:
(df.join(pd.concat({'sum': df.groupby(level=1, axis=1).sum()}, axis=1))
.sortlevel(level=1, axis=1))
Produces this for me:
A B C sum A B C \
geoffrey geoffrey geoffrey geoffrey julian julian julian
0 0.073676 0.279702 0.258112 0.611490 0.204082 0.795725 0.258185
1 0.096733 0.541198 0.757652 1.395583 0.263235 0.507324 0.525919
2 0.028613 0.364365 0.924733 1.317711 0.196243 0.653408 0.174911
3 0.831569 0.738029 0.294473 1.864071 0.528785 0.654160 0.940831
4 0.263250 0.855270 0.264698 1.383218 0.853517 0.803087 0.701937
5 0.069519 0.374411 0.846647 1.290577 0.239797 0.943544 0.189361
6 0.290414 0.873785 0.118713 1.282912 0.980734 0.850097 0.903645
7 0.088387 0.062140 0.872064 1.022591 0.591942 0.566298 0.568482
8 0.061483 0.008404 0.658370 0.728257 0.818167 0.282050 0.449198
9 0.427602 0.171458 0.234426 0.833486 0.217424 0.471933 0.390549
sum
julian
0 1.257992
1 1.296478
2 1.024562
3 2.123776
4 2.358541
5 1.372702
6 2.734476
7 1.726722
8 1.549415
9 1.079906
I just said "here's my df
, lets first group by the names of the people and sum it, then join these two summed columns back to the original df
, then use sortlevel
to sort by level=1
and axis=1
."
So, the only reason 'sum' is appearing after C
column is only because the letter s
comes after C
. If you had a column named x
, this wouldn't work. Not sure if that matters though.
Here's the df
I used for recreation-purposes:
df = pd.DataFrame({
('C', 'julian'): [0.258185, 0.52591899999999991, 0.17491099999999998, 0.94083099999999997, 0.70193700000000003, 0.189361, 0.90364500000000003, 0.56848199999999993, 0.44919799999999993, 0.39054899999999998],
('B', 'geoffrey'): [0.27970200000000001, 0.54119799999999996, 0.36436499999999999, 0.73802900000000005, 0.85527000000000009, 0.37441099999999999, 0.87378500000000003, 0.062140000000000001, 0.008404, 0.171458],
('A', 'julian'): [0.20408199999999999, 0.263235, 0.196243, 0.52878500000000006, 0.85351699999999997, 0.23979699999999998, 0.98073399999999999, 0.59194199999999997, 0.81816699999999998, 0.21742399999999998],
('B', 'julian'): [0.79572500000000002, 0.507324, 0.65340799999999999, 0.65416000000000007, 0.803087, 0.94354400000000005, 0.85009699999999988, 0.56629799999999997, 0.28205000000000002, 0.47193299999999999],
('A', 'geoffrey'): [0.073676000000000005, 0.096733, 0.028613, 0.831569, 0.26324999999999998, 0.069519000000000011, 0.29041400000000001, 0.088387000000000007, 0.061483000000000003, 0.42760200000000004],
('C', 'geoffrey'): [0.25811200000000001, 0.75765199999999999, 0.92473300000000003, 0.29447299999999998, 0.26469799999999999, 0.84664699999999993, 0.11871300000000001, 0.87206399999999995, 0.65837000000000001, 0.23442600000000002]},
columns=pd.MultiIndex.from_tuples([('A','julian'),('A','geoffrey'), ('B','julian'),('B','geoffrey'), ('C','julian'),('C','geoffrey')]))
Edited:
Here's another approach:
sum_columns = [('sum', name) for name in df.columns.levels[1].tolist()]
df[sum_columns] = df.groupby(axis=1, level=1).sum()
df = df.sortlevel(level=1, axis=1)
sum_columns
- looks like this [('sum', 'geoffrey'), ('sum', 'julian')]
.
df[sum_columns]
creates a new 'sum' column for each name on level 1.
If it's desired to have the sum column next to the name, use sortlevel
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With