Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas apply, rolling, groupby with multiple input & multiple output columns

Tags:

python

pandas

I’ve been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby, and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers. So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well. Since I’m not the only one struggling with these questions I thought I’d provide my solutions here with working examples, hoping it helps any existing/future pandas-users.

like image 966
Bob de Graaf Avatar asked May 05 '20 15:05

Bob de Graaf


1 Answers

Important notes

  1. The combination of apply & rolling in pandas has a very strong output requirement. You have to return one single value. You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, e.g. one integer. This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns. I don’t understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn’t have this requirement. Must be due to some internal pandas functions.
  2. The combination of 'apply & rolling' combined with multiple input columns simply does not work! Imagine a dataframe with 2 columns, 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. But it seems pandas can’t handle rolling and multiple input columns at the same time. I tried to use the axis parameter to get it working but:
    • Axis = 0, will call your function per column. In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it’s per column, it only provides the 2 rolling values of that column…
    • Axis = 1, will call your function per row. This is what you probably want, but pandas will not provide a 2x2 input. It actually completely ignores the rolling and only provides one row with values of 2 columns...
  3. When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean). It’s False by default, which means the input will be a pd.Series and thus includes indexes next to the values. If you don’t need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.
  4. When combining 'rolling & groupby', it returns a multi-indexes series which can’t easily serve as an input for a new column. The easiest solution is to append a reset_index(drop=True) as answered & commented here (Python - rolling functions for GroupBy object).
  5. You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs!? Answer: I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby). And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs). Most people probably only need some of the basic examples below, but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.
  6. Please let me know if you have even better, clearer or faster ways to perform any of the solutions below. I'll update my answer and we can all benefit!


Code examples

Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples. For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.

df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]

It will look like this:

group   a   b
0   0   2   2
1   0   4   1
2   0   0   4
3   1   0   2
4   1   3   2
5   1   3   0


Input 1 column, output 1 column

Basic

def func_i1_o1(x):    
    return x+1

df['c'] = df['b'].apply(func_i1_o1)


Rolling

def func_i1_o1_rolling(x):
    return (x[0] + x[1])

df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)


Roling & Groupby

Add the reset_index solution (see notes above) to the rolling function.

df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)




Input 2 columns, output 1 column

Basic

def func_i2_o1(x):
    return np.sum(x)

df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)


Rolling

As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.

def func_i2_o1_rolling(x):
    values_b = x
    values_c = df.loc[x.index, 'c'].to_numpy()
    return np.sum(values_b) + np.sum(values_c)

df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)


Rolling & Groupby

Add the reset_index solution (see notes above) to the rolling function.

df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)




Input 1 column, output 2 columns

Basic

You could use a 'normal' solution by returning pd.Series:

def func_i1_o2(x):
    return pd.Series((x+1, x+2))

df[['i', 'j']] = df['b'].apply(func_i1_o2)

Or you could use the zip/tuple combination which is about 8 times faster!

def func_i1_o2_fast(x):
    return x+1, x+2

df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))


Rolling

As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.

1

def func_i1_o2_rolling_solution1(x):
    output_1 = np.max(x)
    output_2 = np.min(x)
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['m', 'n']] = output_1, output_2
    return 0

df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)

Pros: Everything is done within 1 function.
Cons: You have to create the columns first and it is slower since it doesn't use the raw input.

2

rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()

def func_i1_o2_rolling_solution2(x):
    output_list_1.append(np.max(x))
    output_list_2.append(np.min(x))
    return 0

df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2

Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).
Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.


Rolling & Groupby

Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.

def func_i1_o2_rolling_groupby(x):
    output_1 = np.max(x)
    output_2 = np.min(x)
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['q', 'r']] = output_1, output_2
    return 0

df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)




Input 2 columns, output 2 columns

Basic

I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.

def func_i2_o2(x):
    return np.mean(x), np.median(x)

df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))


Rolling

As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.
1. Get values from other columns using indexes (see func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)

def func_i2_o2_rolling(x):
    values_b = x.to_numpy()
    values_c = df.loc[x.index, 'c'].to_numpy()
    output_1 = np.min([np.sum(values_b), np.sum(values_c)])
    output_2 = np.max([np.sum(values_b), np.sum(values_c)])    
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['u', 'v']] = output_1, output_2
    return 0

df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)


Rolling & Groupby

Add the reset_index solution (see notes above) to the rolling function.

def func_i2_o2_rolling_groupby(x):
    values_b = x.to_numpy()
    values_c = df.loc[x.index, 'c'].to_numpy()
    output_1 = np.min([np.sum(values_b), np.sum(values_c)])
    output_2 = np.max([np.sum(values_b), np.sum(values_c)])    
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['w', 'x']] = output_1, output_2
    return 0

df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
like image 80
Bob de Graaf Avatar answered Nov 16 '22 15:11

Bob de Graaf