I've got a bunch of polling data; I want to compute a Pandas rolling mean to get an estimate for each day based on a three-day window. According to this question, the rolling_*
functions compute the window based on a specified number of values, and not a specific datetime range.
How do I implement this functionality?
Sample input data:
polls_subset.tail(20) Out[185]: favorable unfavorable other enddate 2012-10-25 0.48 0.49 0.03 2012-10-25 0.51 0.48 0.02 2012-10-27 0.51 0.47 0.02 2012-10-26 0.56 0.40 0.04 2012-10-28 0.48 0.49 0.04 2012-10-28 0.46 0.46 0.09 2012-10-28 0.48 0.49 0.03 2012-10-28 0.49 0.48 0.03 2012-10-30 0.53 0.45 0.02 2012-11-01 0.49 0.49 0.03 2012-11-01 0.47 0.47 0.05 2012-11-01 0.51 0.45 0.04 2012-11-03 0.49 0.45 0.06 2012-11-04 0.53 0.39 0.00 2012-11-04 0.47 0.44 0.08 2012-11-04 0.49 0.48 0.03 2012-11-04 0.52 0.46 0.01 2012-11-04 0.50 0.47 0.03 2012-11-05 0.51 0.46 0.02 2012-11-07 0.51 0.41 0.00
Output would have only one row for each date.
In Python, we can calculate the moving average using . rolling() method. This method provides rolling windows over the data, and we can use the mean function over these windows to calculate moving averages. The size of the window is passed as a parameter in the function .
A rolling mean is simply the mean of a certain number of previous periods in a time series. To calculate the rolling mean for one or more columns in a pandas DataFrame, we can use the following syntax: df['column_name']. rolling(rolling_window). mean()
Rolling sum using pandas rolling(). sum() If you apply the above function on a pandas dataframe, it will result in a rolling sum for all the numerical columns in the dataframe.
Definition and Usage. The bfill() method replaces the NULL values with the values from the next row (or next column, if the axis parameter is set to 'columns' ).
In the meantime, a time-window capability was added. See this link.
In [1]: df = DataFrame({'B': range(5)}) In [2]: df.index = [Timestamp('20130101 09:00:00'), ...: Timestamp('20130101 09:00:02'), ...: Timestamp('20130101 09:00:03'), ...: Timestamp('20130101 09:00:05'), ...: Timestamp('20130101 09:00:06')] In [3]: df Out[3]: B 2013-01-01 09:00:00 0 2013-01-01 09:00:02 1 2013-01-01 09:00:03 2 2013-01-01 09:00:05 3 2013-01-01 09:00:06 4 In [4]: df.rolling(2, min_periods=1).sum() Out[4]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 5.0 2013-01-01 09:00:06 7.0 In [5]: df.rolling('2s', min_periods=1).sum() Out[5]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 3.0 2013-01-01 09:00:06 7.0
What about something like this:
First resample the data frame into 1D intervals. This takes the mean of the values for all duplicate days. Use the fill_method
option to fill in missing date values. Next, pass the resampled frame into pd.rolling_mean
with a window of 3 and min_periods=1 :
pd.rolling_mean(df.resample("1D", fill_method="ffill"), window=3, min_periods=1) favorable unfavorable other enddate 2012-10-25 0.495000 0.485000 0.025000 2012-10-26 0.527500 0.442500 0.032500 2012-10-27 0.521667 0.451667 0.028333 2012-10-28 0.515833 0.450000 0.035833 2012-10-29 0.488333 0.476667 0.038333 2012-10-30 0.495000 0.470000 0.038333 2012-10-31 0.512500 0.460000 0.029167 2012-11-01 0.516667 0.456667 0.026667 2012-11-02 0.503333 0.463333 0.033333 2012-11-03 0.490000 0.463333 0.046667 2012-11-04 0.494000 0.456000 0.043333 2012-11-05 0.500667 0.452667 0.036667 2012-11-06 0.507333 0.456000 0.023333 2012-11-07 0.510000 0.443333 0.013333
UPDATE: As Ben points out in the comments, with pandas 0.18.0 the syntax has changed. With the new syntax this would be:
df.resample("1d").sum().fillna(0).rolling(window=3, min_periods=1).mean()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With