I have a dataframe:
Out[78]: contract month year buys adjusted_lots price 0 W Z 5 Sell -5 554.85 1 C Z 5 Sell -3 424.50 2 C Z 5 Sell -2 424.00 3 C Z 5 Sell -2 423.75 4 C Z 5 Sell -3 423.50 5 C Z 5 Sell -2 425.50 6 C Z 5 Sell -3 425.25 7 C Z 5 Sell -2 426.00 8 C Z 5 Sell -2 426.75 9 CC U 5 Buy 5 3328.00 10 SB V 5 Buy 5 11.65 11 SB V 5 Buy 5 11.64 12 SB V 5 Buy 2 11.60
I need a sum of adjusted_lots , price which is weighted average , of price and adjusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)
Similar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.
> newdf = df %>% select ( contract , month , year , buys , adjusted_lots , price ) %>% group_by( contract , month , year , buys) %>% summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" ) > newdf Source: local data frame [4 x 6] contract month year comdty qty avgpx 1 C Z 5 Comdty -19 424.8289 2 CC U 5 Comdty 5 3328.0000 3 SB V 5 Comdty 12 11.6375 4 W Z 5 Comdty -5 554.8500
is the same possible by groupby or any other solution ?
Calculate a Weighted Average in Pandas Using NumpyThe numpy library has a function, average() , which allows us to pass in an optional argument to specify weights of values. The function will take an array into the argument a= , and another array for weights under the argument weights= .
Pandas Groupby Mean To get the average (or mean) value of in each group, you can directly apply the pandas mean() function to the selected columns from the result of pandas groupby.
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.
EDIT: update aggregation so it works with recent version of pandas
To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:
# Define a lambda function to compute the weighted mean: wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"]) # Define a dictionary with the functions to apply for a given column: # the following is deprecated since pandas 0.20: # f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} } # df.groupby(["contract", "month", "year", "buys"]).agg(f) # Groupby and aggregate with namedAgg [1]: df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"), price_weighted_mean=("price", wm)) adjusted_lots price_weighted_mean contract month year buys C Z 5 Sell -19 424.828947 CC U 5 Buy 5 3328.000000 SB V 5 Buy 12 11.637500 W Z 5 Sell -5 554.850000
You can see more here:
and in a similar question here:
Hope this helps
[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling
Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following). See my answer (and others) on this thread.
def weighted_average(df,data_col,weight_col,by_col): df['_data_times_weight'] = df[data_col]*df[weight_col] df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col]) g = df.groupby(by_col) result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum() del df['_data_times_weight'], df['_weight_where_notnull'] return result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With