Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using groupby and apply to add column to each group

I have a DataFrame with a multi-index as a column. I want to group by level 1 and apply a function that generates a new column. I want this calculated column to be added to each group, and thus my dataframe to have new columns for each group.

I made a little dummy script and function to replicate what I want to do.

import pandas as pd
import numpy as np

columns = [('A','julian'),('A','geoffrey'),
       ('B','julian'),('B','geoffrey'),
       ('C','julian'),('C','geoffrey')]

columns = pd.MultiIndex.from_tuples(columns)

dataframe = pd.DataFrame(data=np.random.rand(10,6),columns=columns)

def addColumn(inputDF):
   group = inputDF.columns[0][1]
   inputDF['sum', group] = inputDF.sum(axis=1)
   return inputDF

newColumnsDataframe = dataframe.groupby(level=1, axis=1).apply(addColumn) 

The original dataframe looks as follows:

      A                   B                   C          
    julian  geoffrey    julian  geoffrey    julian  geoffrey
 0  0.204082  0.073676  0.795725  0.279702  0.258185  0.258112
 1  0.263235  0.096733  0.507324  0.541198  0.525919  0.757652
 2  0.196243  0.028613  0.653408  0.364365  0.174911  0.924733
 3  0.528785  0.831569  0.654160  0.738029  0.940831  0.294473
 4  0.853517  0.263250  0.803087  0.855270  0.701937  0.264698
 5  0.239797  0.069519  0.943544  0.374411  0.189361  0.846647
 6  0.980734  0.290414  0.850097  0.873785  0.903645  0.118713
 7  0.591942  0.088387  0.566298  0.062140  0.568482  0.872064
 8  0.818167  0.061483  0.282050  0.008404  0.449198  0.658370
 9  0.217424  0.427602  0.471933  0.171458  0.390549  0.234426

The resulting dataframe should look as follows (I constructed the sum DataFrame separately and concatenated the two dataframes to achieve this results):

      A         B         C       sum         A         B         C  \
   geoffrey  geoffrey  geoffrey  geoffrey    julian    julian    julian   
0  0.073676  0.279702  0.258112  0.611491  0.204082  0.795725  0.258185   
1  0.096733  0.541198  0.757652  1.395584  0.263235  0.507324  0.525919   
2  0.028613  0.364365  0.924733  1.317710  0.196243  0.653408  0.174911   
3  0.831569  0.738029  0.294473  1.864071  0.528785  0.654160  0.940831   
4  0.263250  0.855270  0.264698  1.383219  0.853517  0.803087  0.701937   
5  0.069519  0.374411  0.846647  1.290578  0.239797  0.943544  0.189361   
6  0.290414  0.873785  0.118713  1.282912  0.980734  0.850097  0.903645   
7  0.088387  0.062140  0.872064  1.022590  0.591942  0.566298  0.568482   
8  0.061483  0.008404  0.658370  0.728257  0.818167  0.282050  0.449198   
9  0.427602  0.171458  0.234426  0.833486  0.217424  0.471933  0.390549   

    sum  
   julian  
0  1.257992  
1  1.296478  
2  1.024561  
3  2.123776  
4  2.358542  
5  1.372703  
6  2.734476  
7  1.726721  
8  1.549415  
9  1.079906  

The approach in the script above is based on what makes sense to me and what others have written online about doing this sort of thing. However, newColumnsDataframe still only has 6 columns, not 8 (one added for each name).

I did notice that when I group by level=0 (so by A, B, or C) and use transform(but NOT when I use apply on this level), newColumnsDataframe does have 9 columns, one sum column added for each group. See the code below:

import pandas as pd
import numpy as np

columns = [('A','julian'),('A','geoffrey'),
       ('B','julian'),('B','geoffrey'),
       ('C','julian'),('C','geoffrey')]

columns = pd.MultiIndex.from_tuples(columns)

dataframe = pd.DataFrame(data=np.random.rand(10,6),columns=columns)

def addColumn(inputDF):
    group = inputDF.columns[0][1]
    inputDF[group, 'sum'] = inputDF.sum(axis=1)
    return inputDF

newColumnsDataframe = dataframe.groupby(level=0, axis=1).transform(addColumn)

It was always my understanding that transform worked on each column within the group, whereas apply operated on the group as a whole dataframe. This seems to contradict that. I also noticed, that when I group by level=1 and use transform in stead of apply, it throws the following error:

ValueError: Length mismatch: Expected axis has 10 elements, new values have 6 elements

I am very confused about what is going on. Does anyone know why this DOES work when I use transform and group on level=0. Why there is an error when I do the same but group on level=1. And why grouping on EITHER level and APPLYING the function does not add columns to my final dataframe? Thanks in advance!

(PS: This is not the actual DataFrame or function I am using to add a column, just an easier illustration)

like image 987
jjvandermade Avatar asked Mar 15 '16 21:03

jjvandermade


1 Answers

Kind of messy, but a one-liner:

(df.join(pd.concat({'sum': df.groupby(level=1, axis=1).sum()}, axis=1))
   .sortlevel(level=1, axis=1))

Produces this for me:

          A         B         C       sum         A         B         C  \
   geoffrey  geoffrey  geoffrey  geoffrey    julian    julian    julian   
0  0.073676  0.279702  0.258112  0.611490  0.204082  0.795725  0.258185   
1  0.096733  0.541198  0.757652  1.395583  0.263235  0.507324  0.525919   
2  0.028613  0.364365  0.924733  1.317711  0.196243  0.653408  0.174911   
3  0.831569  0.738029  0.294473  1.864071  0.528785  0.654160  0.940831   
4  0.263250  0.855270  0.264698  1.383218  0.853517  0.803087  0.701937   
5  0.069519  0.374411  0.846647  1.290577  0.239797  0.943544  0.189361   
6  0.290414  0.873785  0.118713  1.282912  0.980734  0.850097  0.903645   
7  0.088387  0.062140  0.872064  1.022591  0.591942  0.566298  0.568482   
8  0.061483  0.008404  0.658370  0.728257  0.818167  0.282050  0.449198   
9  0.427602  0.171458  0.234426  0.833486  0.217424  0.471933  0.390549   

        sum  
     julian  
0  1.257992  
1  1.296478  
2  1.024562  
3  2.123776  
4  2.358541  
5  1.372702  
6  2.734476  
7  1.726722  
8  1.549415  
9  1.079906 

I just said "here's my df, lets first group by the names of the people and sum it, then join these two summed columns back to the original df, then use sortlevel to sort by level=1 and axis=1."

So, the only reason 'sum' is appearing after C column is only because the letter s comes after C. If you had a column named x, this wouldn't work. Not sure if that matters though.

Here's the df I used for recreation-purposes:

df = pd.DataFrame({
      ('C', 'julian'): [0.258185, 0.52591899999999991, 0.17491099999999998, 0.94083099999999997, 0.70193700000000003, 0.189361, 0.90364500000000003, 0.56848199999999993, 0.44919799999999993, 0.39054899999999998],
      ('B', 'geoffrey'): [0.27970200000000001, 0.54119799999999996, 0.36436499999999999, 0.73802900000000005, 0.85527000000000009, 0.37441099999999999, 0.87378500000000003, 0.062140000000000001, 0.008404, 0.171458], 
      ('A', 'julian'): [0.20408199999999999, 0.263235, 0.196243, 0.52878500000000006, 0.85351699999999997, 0.23979699999999998, 0.98073399999999999, 0.59194199999999997, 0.81816699999999998, 0.21742399999999998], 
      ('B', 'julian'): [0.79572500000000002, 0.507324, 0.65340799999999999, 0.65416000000000007, 0.803087, 0.94354400000000005, 0.85009699999999988, 0.56629799999999997, 0.28205000000000002, 0.47193299999999999], 
      ('A', 'geoffrey'): [0.073676000000000005, 0.096733, 0.028613, 0.831569, 0.26324999999999998, 0.069519000000000011, 0.29041400000000001, 0.088387000000000007, 0.061483000000000003, 0.42760200000000004], 
      ('C', 'geoffrey'): [0.25811200000000001, 0.75765199999999999, 0.92473300000000003, 0.29447299999999998, 0.26469799999999999, 0.84664699999999993, 0.11871300000000001, 0.87206399999999995, 0.65837000000000001, 0.23442600000000002]},
      columns=pd.MultiIndex.from_tuples([('A','julian'),('A','geoffrey'), ('B','julian'),('B','geoffrey'), ('C','julian'),('C','geoffrey')]))

Edited:

Here's another approach:

sum_columns = [('sum', name) for name in df.columns.levels[1].tolist()]
df[sum_columns] = df.groupby(axis=1, level=1).sum()
df = df.sortlevel(level=1, axis=1)

sum_columns - looks like this [('sum', 'geoffrey'), ('sum', 'julian')].

df[sum_columns] creates a new 'sum' column for each name on level 1.

If it's desired to have the sum column next to the name, use sortlevel.

like image 54
Jarad Avatar answered Sep 20 '22 02:09

Jarad