Given a large mutli-column Pandas dataframe, I would like to compute the rolling "k-mean" over a window of N
elements as quickly as possible.
Here "k-mean" is defined as the mean of the N-2k
elements of N
which exclude the k
largest and k
smallest elements.
Given the dataframe:
df = pandas.DataFrame(
{'A': [34, 78, -2, -96, 58, -34, 44, -50, 42],
'B': [-82, 28, 96, 46, 36, -34, -20, 10, -40]})
A B
0 34 -82
1 78 28
2 -2 96
3 -96 46
4 58 36
5 -34 -34
6 44 -20
7 -50 10
8 42 -40
With N=6
and k=1
the expected output is:
A B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 14.0 19.0
6 16.5 22.5
7 -10.5 18.0
8 0.5 -2.0
My code that seems to do the required is:
def k_mean(s: pandas.Series, trim: int) -> float:
assert trim >= 0, f"Trim must not be negative, {trim} provided."
if trim == 0:
return s.mean()
return s.sort_values()[trim:-trim].mean()
df.rolling(window=6, axis=0).apply(k_mean, kwargs={'trim': 1})
My question: Is my code correct and if so, is there a faster way to achieve the same result, especially considering large multicolumn dataframes?
Perhaps there is a neat mathematical trick to help?
I'm not too concered about the handling of the starting period if it helps speed up performance, either is can be NaN until N
or can grow to N
once 2k+1
elements are in the window.
You can use Numba JIT to speed up the code significantly. The main idea is to convert each column to a Numpy array and then iterate over them using a sliding window.
import pandas
import numpy
import numba
# Note:
# You can declare the Numba function parameters types to reduce compilation time:
# @numba.njit('float64[::1](int64[::1], int64, int64)')
@numba.njit
def col_k_mean(arr: numpy.array, window: int, trim: int):
out = numpy.full(len(arr), numpy.nan)
if trim == 0:
localSum = arr[0:window].sum()
windowInv = 1.0 / window
for i in range(window-1, len(arr)-1):
out[i] = localSum * windowInv
localSum += arr[i+1] - arr[i-window+1]
if window-1 <= len(arr)-1:
out[len(arr)-1] = localSum * windowInv
else:
for i in range(window-1, len(arr)):
out[i] = numpy.sort(arr[i-window+1:i+1])[trim:-trim].mean()
return out
def apply_k_mean(df: pandas.DataFrame, window: int, trim: int) -> pandas.DataFrame:
assert trim >= 0, f"Trim must not be negative, {trim} provided."
return pandas.DataFrame({col: col_k_mean(df[col].to_numpy(), window, trim) for col in df})
apply_k_mean(df, window=6, trim=1)
Note that this method is efficient only when the windows is not huge. With huge windows, it is better to use a more advanced sorting strategy, like the one based on priority queues (using heaps) or more generally incremental sorts. Alternatively, if trim
is very small and window
is huge, then 2 partitions can be used rather than a full sort.
On my machine, with a random dataframe of size (2, 10000)
and with window=10
as well as trim=2
, the above code is 300 times faster than the reference implementation (excluding the JIT compilation time)! With trim=0
, it is 5800 times faster!
The computation can be even faster on huge dataframes using parallelism (supported in Numba using both parallel=True
and prange
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With