Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas v1.1.0: Groupby rolling count slower than rolling mean & sum

Tags:

python

pandas

I am running a groupby rolling count, sum & mean using Pandas v1.1.0 and I notice that the rolling count is considerably slower than the rolling mean & sum. This seems counter intuitive as we can derive the count from the mean and sum and save time. Is this a bug or am I missing something? Grateful for advice.

import pandas as pd

# Generate sample df
df = pd.DataFrame({'column1': range(600), 'group': 5*['l'+str(i) for i in range(120)]})

# sort by group for easy/efficient joining of new columns to df
df=df.sort_values('group',kind='mergesort').reset_index(drop=True)

# timing of groupby rolling count, sum and mean
%timeit df['mean']=df.groupby('group').rolling(3,min_periods=1)['column1'].mean().values
%timeit df['sum']=df.groupby('group').rolling(3,min_periods=1)['column1'].sum().values
%timeit df['count']=df.groupby('group').rolling(3,min_periods=1)['column1'].count().values

### Output
6.14 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.61 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
76.1 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

### df Output for illustration
print(df.head(10))

   column1 group   mean     sum  count
0        0    l0    0.0     0.0    1.0
1      120    l0   60.0   120.0    2.0
2      240    l0  120.0   360.0    3.0
3      360    l0  240.0   720.0    3.0
4      480    l0  360.0  1080.0    3.0
5        1    l1    1.0     1.0    1.0
6      121    l1   61.0   122.0    2.0
7      241    l1  121.0   363.0    3.0
8      361    l1  241.0   723.0    3.0
9      481    l1  361.0  1083.0    3.0
like image 977
nrcjea001 Avatar asked Aug 08 '20 10:08

nrcjea001


1 Answers

Did you really mean count (number of non-NaN values)? That can not be inferred from just sum and mean.

I suspect that what you are looking for would be a size operator (just the length of the group, irrespective of whether or not there are any NaNs). While size exists in regular groupby, it seems that it is absent in RollingGroupBy (at least as of pandas 1.1.4). One can calculate the size of the rolling groups with:

# DRY:
rgb = df.groupby('group').rolling(3, min_periods=1)['column1']

# size is either:
rgb.apply(len)

# or
rgb.apply(lambda g: g.shape[0])

Neither of those two is as fast as it could, of course, because there needs to be a call to the function for each group, rather than being all vectorized and working just off of the rolling window indices start and end. On my system, either of the above is 2x slower than rgb.sum() or rgb.mean().

Thinking about how one would implement size: it is obvious (just end - start for each window).

Now, in the case one really wanted to speed up count (count of non-NaN values): one could establish a "cumulative count" at first:

cumcnt = (1 - df['column1'].isnull()).cumsum()

(this is very fast, about 200x faster than rgb.mean() on my system).

Then the rolling function could simply take cumcnt[end] - cumcnt[start].

I don't know enough about the internals of RollingGroupBy (and their use of various mixins) to assess feasibility, but at least functionally it seems pretty straightforward.

Update:

It seems that the issue is already fixed with these commits. That was fast and simple --I am impressed with the internal architecture of pandas and all the tools they already have on their Swiss army knife!

like image 154
Pierre D Avatar answered Nov 15 '22 17:11

Pierre D