Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: custom WMAPE function aggregation function to multiple columns without for-loop?

Objective: group pandas dataframe using a custom WMAPE (Weighted Mean Absolute Percent Error) function on multiple forecast columns and one actual data column, without for-loop. I know a for-loop & merges of output dataframes will do the trick. I want to do this efficiently.

Have: WMAPE function, successful use of WMAPE function on one forecast column of dataframe. One column of actual data, variable number of forecast columns.

Input Data: Pandas DataFrame with several categorical columns (City, Person, DT, HOUR), one actual data column (Actual), and four forecast columns (Forecast_1 ... Forecast_4). See link for csv: https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1

Need: WMAPE function applied during groupby on multiple columns with a list of forecast columns fed into groupby line.

Output Desired: An output dataframe with categorical groups columns and all columns of WMAPE. Labeling is preferred but not needed (output image below).

Successful Code so far: Two WMAPE functions: one to take two series in & output a single float value (wmape), and one structured for use in a groupby (wmape_gr):

def wmape(actual, forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(actual-forecast)/actual

    # get a float of the sum of the actual
    ft_actual_sum = actual.sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = actual * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

def wmape_gr(df_in, st_actual, st_forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]

    # get a float of the sum of the actual
    ft_actual_sum = df_in[st_actual].sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = df_in[st_actual] * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)

# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')

Output Looks Like (first two rows):

enter image description here

Desired output would have all forecasts in one shot (dummy data for Forecast_2 ... Forecast_4). I can already do this with a for-loop. I just want to do it within the groupby. I want to call a wmape function four times. I would appreciate any assistance.

like image 898
Inder Jalli Avatar asked Dec 18 '22 18:12

Inder Jalli


2 Answers

This is a really good problem to show how to optimize a groupby.apply in pandas. There are two principles that I use to help with these problems.

  1. Any calculation that is independent of the group should not be done within a groupby
  2. If there is a built-in groupby method, use it first before using apply

Let's go line by line through your wmape_gr function.

se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]

This line is completely independent of any group. You should do this calculation outside of the apply. Below I do this for each of the forecast columns:

df['actual_forecast_diff_1'] = (df['Actual'] - df['Forecast_1']).abs() / df['Actual']
df['actual_forecast_diff_2'] = (df['Actual'] - df['Forecast_2']).abs() / df['Actual']
df['actual_forecast_diff_3'] = (df['Actual'] - df['Forecast_3']).abs() / df['Actual']
df['actual_forecast_diff_4'] = (df['Actual'] - df['Forecast_4']).abs() / df['Actual']

Let's take a look at the next line:

ft_actual_sum = df_in[st_actual].sum()

This line is dependent on the group so we must use a groupby here, but it isn't necessary to place this within the apply function. It will be calculated later on below.

Let's move to the next line:

se_actual_prod_mape = df_in[st_actual] * se_mape

This again is independent of the group. Let's calculate it on the DataFrame as a whole.

df['forecast1_wampe'] = df['actual_forecast_diff_1'] *  df['Actual']
df['forecast2_wampe'] = df['actual_forecast_diff_2'] *  df['Actual']
df['forecast3_wampe'] = df['actual_forecast_diff_3'] *  df['Actual']
df['forecast4_wampe'] = df['actual_forecast_diff_4'] *  df['Actual']

Let's move on to the last two lines:

ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

These lines again are dependent on the group, but we still don't need to use apply. We now have each of the 4 'forecast_wampe' columns calcaulted independent of the group. We simply need to sum each one per group. The same goes for the 'Actual' column.

We can run two separate groupby operations to sum each of these columns like this:

g = df.groupby(['City', 'Person', 'DT'])
actual_sum = g['Actual'].sum()
forecast_wampe_cols = ['forecast1_wampe', 'forecast2_wampe', 'forecast3_wampe', 'forecast4_wampe']
forecast1_wampe_sum = g[forecast_wampe_cols].sum()

We get the following Series and DataFrame returned

enter image description here

enter image description here

Then we just need to divide each of the columns in the DataFrame by the Series. We'll need to use the div method to change the orientation of the division so that the indexes align

forecast1_wampe_sum.div(actual_sum, axis='index')

And this returns our answer:

enter image description here

like image 63
Ted Petrou Avatar answered May 03 '23 13:05

Ted Petrou


If you modify wmape to work with arrays using broadcasting, then you can do it in one shot:

def wmape(actual, forecast):
    # Take a series (actual) and a dataframe (forecast) and calculate wmape
    # for each forecast. Output shape is (1, num_forecasts)

    # Convert to numpy arrays for broadasting
    forecast = np.array(forecast.values)
    actual=np.array(actual.values).reshape((-1, 1))

    # Make an array of mape (same shape as forecast)
    se_mape = abs(actual-forecast)/actual

    # Calculate sum of actual values
    ft_actual_sum = actual.sum(axis=0)

    # Multiply the actual values by the mape
    se_actual_prod_mape = actual * se_mape

    # Take the sum of the product of actual values and mape
    # Make sure to sum down the rows (1 for each column)
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)

    # Calculate the wmape for each forecast and return as a dictionary
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
    return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}

Then use apply on the proper columns:

# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'], 
                                        x[[c for c in x if 'Forecast' in c]])).\
            to_frame().reset_index()

This results in a dataframe with a single dictionary column. Intermediate Results

The single column can be converted to multiple columns for the correct format:

# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]), 
                    pd.DataFrame(list(new_df[0].values))], axis=1)

Result:

Result of operations

like image 32
willk Avatar answered May 03 '23 14:05

willk