Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating weighted average by GroupBy.agg and a named aggregation

Pandas version 0.25 supports "Named Aggregation" via function agg and namedtuples. You need to pass column, aggregator pairs as the doc describes. It also says:

If your aggregation functions require additional arguments, partially apply them with functools.partial().

I would like to apply this principle to get a weighted average (besides a simple count and average). My input table is

import pandas as pd

t = pd.DataFrame({'bucket':['a', 'a', 'b', 'b', 'b'], 'weight': [2, 3, 1, 4, 3], 
                  'qty': [100, 500, 200, 800, 700]})

and my query fails:

import functools
import numpy as np

t.groupby('bucket').agg(
        NR= ('bucket', 'count'),
        AVG_QTY= ('qty', np.mean),
        W_AVG_QTY= ('qty', functools.partial(np.average, weights='weight'))
   )

with an error message:

TypeError: 1D weights expected when shapes of a and weights differ.

I assume the problem comes from fixing the parameter to be another column instead of a constant? How can I make this work without the workaround that uses apply and a lambda expression that returns a Series?

like image 227
Ferenc Bodon Avatar asked Dec 11 '19 16:12

Ferenc Bodon


People also ask

How do you calculate weighted aggregate?

To find a weighted average, multiply each number by its weight, then add the results. If the weights don't add up to one, find the sum of all the variables multiplied by their weight, then divide by the sum of the weights.

What is AGG in Groupby?

agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation. The aggregation is for each column.


1 Answers

A weighted average requires 2 separate Series (i.e. a DataFrame). Because of this GroupBy.apply is the correct aggregation method to use. Use pd.concat to join the results.

pd.concat([t.groupby('bucket').agg(NR = ('bucket', 'count'),
                                   AVG_QTY = ('qty', np.mean)),
           (t.groupby('bucket').apply(lambda gp: np.average(gp.qty, weights=gp.weight))
             .rename('W_AVG_QTY'))], 
          axis=1)

#        NR     AVG_QTY  W_AVG_QTY
#bucket                           
#a        2  300.000000      340.0
#b        3  566.666667      687.5

This can be done with agg, assuming your DataFrame has a unique Index, though I can't guarantee it will be very performant given all the slicing. We create our own function that accepts the Series of values and the entire DataFrame. The function then subsets the DataFrame using the Series to obtain the weights for each group.

def my_w_avg(s, df, wcol):
    return np.average(s, weights=df.loc[s.index, wcol])

t.groupby('bucket').agg(
        NR= ('bucket', 'count'),
        AVG_QTY= ('qty', np.mean),
        W_AVG_QTY= ('qty', functools.partial(my_w_avg, df=t, wcol='weight'))
   )

#        NR     AVG_QTY  W_AVG_QTY
#bucket                           
#a        2  300.000000      340.0
#b        3  566.666667      687.5
like image 102
ALollz Avatar answered Sep 30 '22 03:09

ALollz