Objective: group pandas dataframe using a custom WMAPE (Weighted Mean Absolute Percent Error) function on multiple forecast columns and one actual data column, without for-loop. I know a for-loop & merges of output dataframes will do the trick. I want to do this efficiently.
Have: WMAPE function, successful use of WMAPE function on one forecast column of dataframe. One column of actual data, variable number of forecast columns.
Input Data: Pandas DataFrame with several categorical columns (City, Person, DT, HOUR), one actual data column (Actual), and four forecast columns (Forecast_1 ... Forecast_4). See link for csv: https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1
Need: WMAPE function applied during groupby on multiple columns with a list of forecast columns fed into groupby line.
Output Desired: An output dataframe with categorical groups columns and all columns of WMAPE. Labeling is preferred but not needed (output image below).
Successful Code so far: Two WMAPE functions: one to take two series in & output a single float value (wmape), and one structured for use in a groupby (wmape_gr):
def wmape(actual, forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(actual-forecast)/actual
# get a float of the sum of the actual
ft_actual_sum = actual.sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = actual * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
def wmape_gr(df_in, st_actual, st_forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
# get a float of the sum of the actual
ft_actual_sum = df_in[st_actual].sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = df_in[st_actual] * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)
# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
Output Looks Like (first two rows):
Desired output would have all forecasts in one shot (dummy data for Forecast_2 ... Forecast_4). I can already do this with a for-loop. I just want to do it within the groupby. I want to call a wmape function four times. I would appreciate any assistance.
This is a really good problem to show how to optimize a groupby.apply in pandas. There are two principles that I use to help with these problems.
Let's go line by line through your wmape_gr
function.
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
This line is completely independent of any group. You should do this calculation outside of the apply. Below I do this for each of the forecast columns:
df['actual_forecast_diff_1'] = (df['Actual'] - df['Forecast_1']).abs() / df['Actual']
df['actual_forecast_diff_2'] = (df['Actual'] - df['Forecast_2']).abs() / df['Actual']
df['actual_forecast_diff_3'] = (df['Actual'] - df['Forecast_3']).abs() / df['Actual']
df['actual_forecast_diff_4'] = (df['Actual'] - df['Forecast_4']).abs() / df['Actual']
Let's take a look at the next line:
ft_actual_sum = df_in[st_actual].sum()
This line is dependent on the group so we must use a groupby here, but it isn't necessary to place this within the apply function. It will be calculated later on below.
Let's move to the next line:
se_actual_prod_mape = df_in[st_actual] * se_mape
This again is independent of the group. Let's calculate it on the DataFrame as a whole.
df['forecast1_wampe'] = df['actual_forecast_diff_1'] * df['Actual']
df['forecast2_wampe'] = df['actual_forecast_diff_2'] * df['Actual']
df['forecast3_wampe'] = df['actual_forecast_diff_3'] * df['Actual']
df['forecast4_wampe'] = df['actual_forecast_diff_4'] * df['Actual']
Let's move on to the last two lines:
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
These lines again are dependent on the group, but we still don't need to use apply. We now have each of the 4 'forecast_wampe' columns calcaulted independent of the group. We simply need to sum each one per group. The same goes for the 'Actual' column.
We can run two separate groupby operations to sum each of these columns like this:
g = df.groupby(['City', 'Person', 'DT'])
actual_sum = g['Actual'].sum()
forecast_wampe_cols = ['forecast1_wampe', 'forecast2_wampe', 'forecast3_wampe', 'forecast4_wampe']
forecast1_wampe_sum = g[forecast_wampe_cols].sum()
We get the following Series and DataFrame returned
Then we just need to divide each of the columns in the DataFrame by the Series. We'll need to use the div
method to change the orientation of the division so that the indexes align
forecast1_wampe_sum.div(actual_sum, axis='index')
And this returns our answer:
If you modify wmape
to work with arrays using broadcasting, then you can do it in one shot:
def wmape(actual, forecast):
# Take a series (actual) and a dataframe (forecast) and calculate wmape
# for each forecast. Output shape is (1, num_forecasts)
# Convert to numpy arrays for broadasting
forecast = np.array(forecast.values)
actual=np.array(actual.values).reshape((-1, 1))
# Make an array of mape (same shape as forecast)
se_mape = abs(actual-forecast)/actual
# Calculate sum of actual values
ft_actual_sum = actual.sum(axis=0)
# Multiply the actual values by the mape
se_actual_prod_mape = actual * se_mape
# Take the sum of the product of actual values and mape
# Make sure to sum down the rows (1 for each column)
ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)
# Calculate the wmape for each forecast and return as a dictionary
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}
Then use apply
on the proper columns:
# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'],
x[[c for c in x if 'Forecast' in c]])).\
to_frame().reset_index()
This results in a dataframe with a single dictionary column.
The single column can be converted to multiple columns for the correct format:
# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]),
pd.DataFrame(list(new_df[0].values))], axis=1)
Result:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With