I am trying to create a new column in a Pandas dataframe based on two columns in a grouped dataframe.
Specifically, I am trying to replicate the output from this R code:
library(data.table)
df = data.table(a = 1:6,
b = 7:12,
c = c('q', 'q', 'q', 'q', 'w', 'w')
)
df[, ab_weighted := sum(a)/sum(b), by = "c"]
df[, c('c', 'a', 'b', 'ab_weighted')]
Output:
So far, I tried the following in Python:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[7,8,9,10,11,12],
'c':['q', 'q', 'q', 'q', 'w', 'w']
})
df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b']))
Output:
When I change the apply
in the code above to transform
I get an error:
TypeError: an integer is required
Transform works fine, if I use only a single column though:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[7,8,9,10,11,12],
'c':['q', 'q', 'q', 'q', 'w', 'w']
})
df.groupby(['c'])['a', 'b'].transform(lambda x: sum(x))
But obviously, this is not the same answer:
Is there a way to get the result from my data.table code in Pandas without having to generate intermediate columns (because it then I could use transform
on the final column?
Any help greatly appreciated:)
Just fixing your code using map
,R
and pandas
still have different , which mean not every R
function you can find a replacement in pandas
df.c.map(df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b'])))
Out[67]:
0 0.294118
1 0.294118
2 0.294118
3 0.294118
4 0.478261
5 0.478261
Name: c, dtype: float64
You're one step away.
v = df.groupby('c')[['a', 'b']].transform('sum')
df['ab_weighted'] = v.a / v.b
df
a b c ab_weighted
0 1 7 q 0.294118
1 2 8 q 0.294118
2 3 9 q 0.294118
3 4 10 q 0.294118
4 5 11 w 0.478261
5 6 12 w 0.478261
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With