Assume I have daily data (not regularly spaced), I want to compute for each month the moving standard deviation (or an arbitrarily non linear function) in the past 5 months. For example, for May 2012 I would compute the stddev from the period starting from Jan 2012 to May 2012 (5 months). For June 2012 the period starts in Feb 2012, etc. The final result is a time series with monthly values.
I cannot apply a rolling window because this would first be daily and secondly I need to specify the number of values (a rolling window does not aggregate by time frame, some posts addressed this issue but they are not relevant to my problem as the rolling would still be for each new day).
I cannot apply resampling, because then the sample would be every 5 months, e..g I would only have values for May 2012, Oct 2012, March 2013... Finally, as the function is not linear I cannot reconstruct it by first doing a monthly sample and then applying a 5 period rolling window on it.
So I would need a sort of resampling functionality applied to a rolling window defined by time interval (not number of values).
How can I do this in pandas? One approach could be to combine several (5 in this example) resampled (5 months) time series, each with one month of offset and then align all these series into one... but I do not know how to implement this.
Resample Pandas time-series data. The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
rolling() function is a very useful function. It Provides rolling window calculations over the underlying data in the given Series object. Syntax: Series.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
Resample Hourly Data to Daily Dataresample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them. In this case, you want total daily rainfall, so you will use the resample() method together with . sum() .
I had a similar issue dealing with a timedelta series where I wanted to take a moving average and then resample. Here is an example where I have 100 seconds of data. I take a rolling average of 10 second windows and then resample for every 5 seconds, taking the first entry in each resample bin. The result should be the previous 10 second average at 5 second increments. You could do something similar with month format instead of seconds:
df = pd.DataFrame(range(0,100), index=pd.TimedeltaIndex(range(0,100),'s'))
df.rolling('10s').mean().resample('5s').first()
Result:
0
00:00:00 0.0
00:00:05 2.5
00:00:10 5.5
00:00:15 10.5
00:00:20 15.5
00:00:25 20.5
00:00:30 25.5
00:00:35 30.5
00:00:40 35.5
00:00:45 40.5
00:00:50 45.5
00:00:55 50.5
00:01:00 55.5
00:01:05 60.5
00:01:10 65.5
00:01:15 70.5
00:01:20 75.5
00:01:25 80.5
00:01:30 85.5
00:01:35 90.5
Here's an attempt - not super clean, but it might work.
Dummy data:
df = pd.DataFrame(data={'a': 1.},
index=pd.date_range(start='2001-1-1', periods=1000))
First define a function to decrease a date n
number of months. This needs to be cleaned up, but works for n<=12.
from datetime import datetime
def decrease_month(date, n):
assert(n <= 12)
new_month = date.month - n
year_offset = 0
if new_month <= 0:
year_offset = -1
new_month = 12 + new_month
return datetime(date.year + year_offset, new_month, 1)
Then, add 5 new columns for the 5 rolling periods that each date will cross.
for n in range(rolling_period):
df['m_' + str(n)] = df.index.map(lambda x: decrease_month(x, n))
Then - use the melt
function to convert the data from wide to long, so each rolling period will have one entry.
df_m = pd.melt(df, id_vars='a')
You should be able to groupby the newly created column, and each date will represent the right 5 month rolling period.
In [222]: df_m.groupby('value').sum()
Out[222]:
a
value
2000-09-01 31
2000-10-01 59
2000-11-01 90
2000-12-01 120
2001-01-01 151
2001-02-01 150
2001-03-01 153
2001-04-01 153
2001-05-01 153
2001-06-01 153
2001-07-01 153
...
I have solved a similar problem with the following code:
interval = 5
frames = []
for base in range(interval):
frame = data.resample(f"{interval}min", base=base).last()
frames.append(frame)
pd.concat(frames, axis=0).sort_index()
Here I create 5 data frames which are resampled at the same interval, but have different offsets (the base parameter). Then I just have to concatenate and sort them. Should usually be much more efficient than rolling + resampling (the only overhead is the sorting).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With