Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Broadcast groupby result as new column in original DataFrame

I am trying to create a new column in a Pandas dataframe based on two columns in a grouped dataframe.

Specifically, I am trying to replicate the output from this R code:

library(data.table)

df = data.table(a = 1:6, 
            b = 7:12,
            c = c('q', 'q', 'q', 'q', 'w', 'w')
            )


df[, ab_weighted := sum(a)/sum(b), by = "c"]
df[, c('c', 'a', 'b', 'ab_weighted')]

Output:

enter image description here

So far, I tried the following in Python:

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b']))

Output:

enter image description here

When I change the apply in the code above to transform I get an error: TypeError: an integer is required

Transform works fine, if I use only a single column though:

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

 df.groupby(['c'])['a', 'b'].transform(lambda x: sum(x))

But obviously, this is not the same answer:

enter image description here

Is there a way to get the result from my data.table code in Pandas without having to generate intermediate columns (because it then I could use transform on the final column?

Any help greatly appreciated:)

like image 904
Christoph Avatar asked Dec 10 '22 04:12

Christoph


2 Answers

Just fixing your code using map,R and pandas still have different , which mean not every R function you can find a replacement in pandas

df.c.map(df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b'])))
Out[67]: 
0    0.294118
1    0.294118
2    0.294118
3    0.294118
4    0.478261
5    0.478261
Name: c, dtype: float64
like image 50
BENY Avatar answered Dec 12 '22 17:12

BENY


You're one step away.

v = df.groupby('c')[['a', 'b']].transform('sum')
df['ab_weighted'] = v.a / v.b

df
   a   b  c  ab_weighted
0  1   7  q     0.294118
1  2   8  q     0.294118
2  3   9  q     0.294118
3  4  10  q     0.294118
4  5  11  w     0.478261
5  6  12  w     0.478261
like image 32
cs95 Avatar answered Dec 12 '22 18:12

cs95