I’ve been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby, and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers. So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well. Since I’m not the only one struggling with these questions I thought I’d provide my solutions here with working examples, hoping it helps any existing/future pandas-users.
Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples. For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.
df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]
It will look like this:
group a b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0
Basic
def func_i1_o1(x):
return x+1
df['c'] = df['b'].apply(func_i1_o1)
Rolling
def func_i1_o1_rolling(x):
return (x[0] + x[1])
df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Roling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Basic
def func_i2_o1(x):
return np.sum(x)
df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Rolling
As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.
def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)
df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Basic
You could use a 'normal' solution by returning pd.Series:
def func_i1_o2(x):
return pd.Series((x+1, x+2))
df[['i', 'j']] = df['b'].apply(func_i1_o2)
Or you could use the zip/tuple combination which is about 8 times faster!
def func_i1_o2_fast(x):
return x+1, x+2
df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Rolling
As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.
1
def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0
df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Pros: Everything is done within 1 function.
Cons: You have to create the columns first and it is slower since it doesn't use the raw input.
2
rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()
def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0
df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2
Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).
Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.
Rolling & Groupby
Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.
def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0
df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Basic
I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.
def func_i2_o2(x):
return np.mean(x), np.median(x)
df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Rolling
As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.
1. Get values from other columns using indexes (see func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)
def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0
df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0
df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With