Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: calculate rolling mean (moving average) over variable number of rows

Say I have the following dataframe

import pandas as pd
df = pd.DataFrame({ 'distance':[2.0, 3.0, 1.0, 4.0],
                    'velocity':[10.0, 20.0, 5.0, 40.0] })

gives the dataframe

   distance  velocity
0         2.0        10.0
1         3.0        20.0
2         1.0        5.0
3         4.0        40.0

How can I calculate the average of the velocity column over the rolling sum of the distance column? With the example above, create a rolling sum over the last N rows in order to get a minimum cumulative distance of 5, and then calculate the average velocity over those rows.

My target output would then be like this:

   distance  velocity    rv
0         2.0        10.0    NaN
1         3.0        20.0    15.0
2         1.0         5.0    11.7
3         4.0        40.0    22.5

where

15.0 = (10+20)/2        (2 because 3 + 2     >= 5)
11.7 = (10 + 20 + 5)/3  (3 because 1 + 3 + 2 >= 5) 
22.5 = (5 + 40)/2       (2 because 4 + 1     >= 5)

Update: in Pandas-speak, my code should find the index of the reverse cumulative distance sum back from my current record (such that it is 5 or greater), and then use that index to calculate the start of the moving average.

like image 308
philshem Avatar asked Nov 24 '17 22:11

philshem


People also ask

How does Python calculate rolling average in pandas?

In Python, we can calculate the moving average using . rolling() method. This method provides rolling windows over the data, and we can use the mean function over these windows to calculate moving averages. The size of the window is passed as a parameter in the function .

How do you calculate mean of multiple columns in pandas?

To calculate the mean of whole columns in the DataFrame, use pandas.Series.mean() with a list of DataFrame columns. You can also get the mean for all numeric columns using DataFrame.mean(), use axis=0 argument to calculate the column-wise mean of the DataFrame.

What is Min_periods in rolling?

The min_periods argument specifies the minimum number of observations in the current window required to generate a rolling value; otherwise, the result is NaN .

How do you calculate rolling average?

A rolling average continuously updates the average of a data set to include all the data in the set until that point. For example, the rolling average of return quantities at March 2012 would be calculated by adding the return quantities in January, February, and March, and then dividing that sum by three.


1 Answers

Not a particularly pandasy solution, but it sounds like you want to do something like

df['rv'] = np.nan
for i in range(len(df)):
    j = i
    s = 0
    while j >= 0 and s < 5:
        s += df['distance'].loc[j]
        j -= 1
    if s >= 5:
        df['rv'].loc[i] = df['velocity'][j+1:i+1].mean()

Update: Since this answer, the OP stated that they want a "valid Pandas solution (e.g. without loops)". If we take this to mean that they want something more performant than the above, then, perhaps ironically given the comment, the first optimization that comes to mind is to avoid the data frame unless needed:

l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
    j = i
    s = 0
    while j >= 0 and s < 5:
        s += d[j]
        j -= 1
    if s >= 5:
        a[i] = v[j+1:i+1].mean()
df['rv'] = a

Moreover, as suggested by @JohnE, numba quickly comes in handy for further optimization. While it won't do much on the first solution above, the second solution can be decorated with a @numba.jit out-of-the-box with immediate benefits. Benchmarking all three solutions on

pd.DataFrame({'velocity': 50*np.random.random(10000), 'distance': 5*np.random.rand(10000)})

I get the following results:

          Method                 Benchmark
-----------------------------------------------
Original data frame based     4.65 s ± 325 ms
Pure numpy array based       80.8 ms ± 9.95 ms
Jitted numpy array based      766 µs ± 52 µs

Even the innocent-looking mean is enough to throw off numba; if we get rid of that and go instead with

@numba.jit
def numba_example():
    l = len(df)
    a = np.empty(l)
    d = df['distance'].values
    v = df['velocity'].values
    for i in range(l):
        j = i
        s = 0
        while j >= 0 and s < 5:
            s += d[j]
            j -= 1
        if s >= 5:
            for k in range(j+1, i+1):
                a[i] += v[k]
            a[i] /= (i-j)
    df['rv'] = a

then the benchmark reduces to 158 µs ± 8.41 µs.

Now, if you happen to know more about the structure of df['distance'], the while loop can probably be optimized further. (For example, if the values happen to always be much lower than 5, it will be faster to cut the cumulative sum from its tail, rather than recalculating everything.)

like image 140
fuglede Avatar answered Sep 28 '22 06:09

fuglede