Pandas version 0.25 supports "Named Aggregation" via function agg
and namedtuples
. You need to pass column, aggregator pairs as the doc describes. It also says:
If your aggregation functions require additional arguments, partially apply them with functools.partial().
I would like to apply this principle to get a weighted average (besides a simple count and average). My input table is
import pandas as pd
t = pd.DataFrame({'bucket':['a', 'a', 'b', 'b', 'b'], 'weight': [2, 3, 1, 4, 3],
'qty': [100, 500, 200, 800, 700]})
and my query fails:
import functools
import numpy as np
t.groupby('bucket').agg(
NR= ('bucket', 'count'),
AVG_QTY= ('qty', np.mean),
W_AVG_QTY= ('qty', functools.partial(np.average, weights='weight'))
)
with an error message:
TypeError: 1D weights expected when shapes of a and weights differ.
I assume the problem comes from fixing the parameter to be another column instead of a constant? How can I make this work without the workaround that uses apply
and a lambda expression that returns a Series
?
To find a weighted average, multiply each number by its weight, then add the results. If the weights don't add up to one, find the sum of all the variables multiplied by their weight, then divide by the sum of the weights.
agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation. The aggregation is for each column.
A weighted average requires 2 separate Series (i.e. a DataFrame). Because of this GroupBy.apply
is the correct aggregation method to use. Use pd.concat
to join the results.
pd.concat([t.groupby('bucket').agg(NR = ('bucket', 'count'),
AVG_QTY = ('qty', np.mean)),
(t.groupby('bucket').apply(lambda gp: np.average(gp.qty, weights=gp.weight))
.rename('W_AVG_QTY'))],
axis=1)
# NR AVG_QTY W_AVG_QTY
#bucket
#a 2 300.000000 340.0
#b 3 566.666667 687.5
This can be done with agg
, assuming your DataFrame has a unique Index, though I can't guarantee it will be very performant given all the slicing. We create our own function that accepts the Series of values and the entire DataFrame. The function then subsets the DataFrame
using the Series to obtain the weights for each group.
def my_w_avg(s, df, wcol):
return np.average(s, weights=df.loc[s.index, wcol])
t.groupby('bucket').agg(
NR= ('bucket', 'count'),
AVG_QTY= ('qty', np.mean),
W_AVG_QTY= ('qty', functools.partial(my_w_avg, df=t, wcol='weight'))
)
# NR AVG_QTY W_AVG_QTY
#bucket
#a 2 300.000000 340.0
#b 3 566.666667 687.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With