I am running a groupby rolling count, sum & mean using Pandas v1.1.0 and I notice that the rolling count is considerably slower than the rolling mean & sum. This seems counter intuitive as we can derive the count from the mean and sum and save time. Is this a bug or am I missing something? Grateful for advice.
import pandas as pd
# Generate sample df
df = pd.DataFrame({'column1': range(600), 'group': 5*['l'+str(i) for i in range(120)]})
# sort by group for easy/efficient joining of new columns to df
df=df.sort_values('group',kind='mergesort').reset_index(drop=True)
# timing of groupby rolling count, sum and mean
%timeit df['mean']=df.groupby('group').rolling(3,min_periods=1)['column1'].mean().values
%timeit df['sum']=df.groupby('group').rolling(3,min_periods=1)['column1'].sum().values
%timeit df['count']=df.groupby('group').rolling(3,min_periods=1)['column1'].count().values
### Output
6.14 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.61 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
76.1 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
### df Output for illustration
print(df.head(10))
column1 group mean sum count
0 0 l0 0.0 0.0 1.0
1 120 l0 60.0 120.0 2.0
2 240 l0 120.0 360.0 3.0
3 360 l0 240.0 720.0 3.0
4 480 l0 360.0 1080.0 3.0
5 1 l1 1.0 1.0 1.0
6 121 l1 61.0 122.0 2.0
7 241 l1 121.0 363.0 3.0
8 361 l1 241.0 723.0 3.0
9 481 l1 361.0 1083.0 3.0
Did you really mean count
(number of non-NaN values)? That can not be inferred from just sum
and mean
.
I suspect that what you are looking for would be a size
operator (just the length of the group, irrespective of whether or not there are any NaNs). While size
exists in regular groupby
, it seems that it is absent in RollingGroupBy
(at least as of pandas 1.1.4). One can calculate the size of the rolling groups with:
# DRY:
rgb = df.groupby('group').rolling(3, min_periods=1)['column1']
# size is either:
rgb.apply(len)
# or
rgb.apply(lambda g: g.shape[0])
Neither of those two is as fast as it could, of course, because there needs to be a call to the function for each group, rather than being all vectorized and working just off of the rolling window indices start
and end
. On my system, either of the above is 2x slower than rgb.sum()
or rgb.mean()
.
Thinking about how one would implement size
: it is obvious (just end - start
for each window).
Now, in the case one really wanted to speed up count
(count of non-NaN values): one could establish a "cumulative count" at first:
cumcnt = (1 - df['column1'].isnull()).cumsum()
(this is very fast, about 200x faster than rgb.mean()
on my system).
Then the rolling function could simply take cumcnt[end] - cumcnt[start]
.
I don't know enough about the internals of RollingGroupBy
(and their use of various mixin
s) to assess feasibility, but at least functionally it seems pretty straightforward.
Update:
It seems that the issue is already fixed with these commits. That was fast and simple --I am impressed with the internal architecture of pandas and all the tools they already have on their Swiss army knife!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With