Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

groupby weighted average and sum in pandas dataframe

Tags:

python

pandas

r

I have a dataframe:

    Out[78]:     contract month year  buys  adjusted_lots    price 0         W     Z    5  Sell             -5   554.85 1         C     Z    5  Sell             -3   424.50 2         C     Z    5  Sell             -2   424.00 3         C     Z    5  Sell             -2   423.75 4         C     Z    5  Sell             -3   423.50 5         C     Z    5  Sell             -2   425.50 6         C     Z    5  Sell             -3   425.25 7         C     Z    5  Sell             -2   426.00 8         C     Z    5  Sell             -2   426.75 9        CC     U    5   Buy              5  3328.00 10       SB     V    5   Buy              5    11.65 11       SB     V    5   Buy              5    11.64 12       SB     V    5   Buy              2    11.60 

I need a sum of adjusted_lots , price which is weighted average , of price and adjusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)

Similar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.

> newdf = df %>%   select ( contract , month , year , buys , adjusted_lots , price ) %>%   group_by( contract , month , year ,  buys) %>%   summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )  > newdf Source: local data frame [4 x 6]    contract month year comdty qty     avgpx 1        C     Z    5 Comdty -19  424.8289 2       CC     U    5 Comdty   5 3328.0000 3       SB     V    5 Comdty  12   11.6375 4        W     Z    5 Comdty  -5  554.8500 

is the same possible by groupby or any other solution ?

like image 999
samsri Avatar asked Jul 20 '15 15:07

samsri


People also ask

How do you calculate weighted average in pandas?

Calculate a Weighted Average in Pandas Using NumpyThe numpy library has a function, average() , which allows us to pass in an optional argument to specify weights of values. The function will take an array into the argument a= , and another array for weights under the argument weights= .

How do you get Groupby and average in pandas?

Pandas Groupby Mean To get the average (or mean) value of in each group, you can directly apply the pandas mean() function to the selected columns from the result of pandas groupby.

What is possible using Groupby () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.


2 Answers

EDIT: update aggregation so it works with recent version of pandas

To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:

# Define a lambda function to compute the weighted mean: wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])  # Define a dictionary with the functions to apply for a given column: # the following is deprecated since pandas 0.20: # f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} } # df.groupby(["contract", "month", "year", "buys"]).agg(f)  # Groupby and aggregate with namedAgg [1]: df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),                                                         price_weighted_mean=("price", wm))                            adjusted_lots  price_weighted_mean contract month year buys                                     C        Z     5    Sell            -19           424.828947 CC       U     5    Buy               5          3328.000000 SB       V     5    Buy              12            11.637500 W        Z     5    Sell             -5           554.850000 

You can see more here:

  • http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

and in a similar question here:

  • Apply multiple functions to multiple groupby columns

Hope this helps

[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling

like image 157
jrjc Avatar answered Sep 23 '22 18:09

jrjc


Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following). See my answer (and others) on this thread.

def weighted_average(df,data_col,weight_col,by_col):     df['_data_times_weight'] = df[data_col]*df[weight_col]     df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])     g = df.groupby(by_col)     result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()     del df['_data_times_weight'], df['_weight_where_notnull']     return result 
like image 35
ErnestScribbler Avatar answered Sep 23 '22 18:09

ErnestScribbler