So I have two value columns and two weight columns in a Pandas DataFrame, and I want to generate a third column that is the grouped by, weighted, average of those two columns.
So for:
df = pd.DataFrame({'category':['a','a','b','b'],
'var1':np.random.randint(0,100,4),
'var2':np.random.randint(0,100,4),
'weights1':np.random.random(4),
'weights2':np.random.random(4)})
df
category var1 var2 weights1 weights2
0 a 84 45 0.955234 0.729862
1 a 49 5 0.225470 0.159662
2 b 77 95 0.957212 0.991960
3 b 27 65 0.491877 0.195680
I'd want to accomplish:
df
category var1 var2 weights1 weights2 average
0 a 84 45 0.955234 0.729862 67.108023
1 a 49 5 0.225470 0.159662 30.759124
2 b 77 95 0.957212 0.991960 86.160443
3 b 27 65 0.491877 0.195680 37.814851
I've already accomplished this using just arithmetic operators like this:
df['average'] = df.groupby('category', group_keys=False) \
.apply(lambda g: (g.weights1 * g.var1 + g.weights2 * g.var2) / (g.weights1 + g.weights2))
But I want to generalize it to using numpy.average, so I could for example take the weighted average of 3 columns or more.
I'm trying something like this, but it doesn't seem to work:
df['average'] = df.groupby('category', group_keys=False) \
.apply(lambda g: np.average([g.var1, g.var2], axis=0, weights=[g.weights1, g.weights2]))
returning
TypeError: incompatible index of inserted column with frame index
Can anyone help me do this?
I don't even think you need groupby
here. Notice, this matches the output with apply
+ lambda
.
Try this:
col=df.drop('category',1)
s=col.groupby(col.columns.str.findall(r'\d+').str[0],axis=1).prod().sum(1)
s/df.filter(like='weight').sum(1)
Out[33]:
0 67.108014
1 30.759168
2 86.160444
3 37.814871
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With