I have a large time-series set of data at 30 minute intervals and trying to do a sliding window on this set of data but separately for each point of the day using pandas.
I'm no statistician and not great at thinking or coding for this sort of work but here is my clumsy attempt at doing what I want. I'm really looking for help improving it as I know there will be a better way of doing this, possibly using MultiIndexes and some proper iteration? But I have struggled to do this across the 'time-axes'.
def sliding_window(run,data,type='mean'):
data = data.asfreq('30T')
for x in date_range(run.START, run.END, freq='1d'):
if int(datetime.strftime(x, "%w")) == 0 or int(datetime.strftime(x, "%w")) == 6:
points = data.select(weekends).truncate(x - relativedelta(days=run.WINDOW),x + relativedelta(days=run.WINDOW)).groupby(lambda date: minutes(date, x)).mean()
else:
points = data.select(weekdays).truncate(x - relativedelta(days=run.WINDOW),x + relativedelta(days=run.WINDOW)).groupby(lambda date: minutes(date, x)).mean()
for point in points.index:
data[datetime(x.year,x.month,x.day,point.hour,point.minute)] = points[point]
return data
run.START, run.END and run.WINDOW are two points within data and 45 (days). I've been staring at this code a lot so I'm not sure what (if any) of it make sense to anyone else, please ask so that I can clarify anything else.
SOLVED: (Solution courtesy of crewbum)
The modified function which as expected goes stupidly fast:
def sliding_window(run,data,am='mean',days='weekdays'):
data = data.asfreq('30T')
data = DataFrame({'Day': [d.date() for d in data.index], 'Time': [d.time() for d in data.index], 'Weekend': [weekday_string(d) for d in data.index], 'data': data})
pivot = data.pivot_table(values='data', rows='Day', cols=['Weekend', 'Time'])
pivot = pivot[days]
if am == 'median':
mean = rolling_median(pivot, run.WINDOW*2, min_periods=1)
mean = rolling_mean(pivot, run.WINDOW*2, min_periods=1)
return DataFrame({'mean': unpivot(mean), 'amax': np.tile(pivot.max().values, pivot.shape[0]), 'amin': np.tile(pivot.min().values, pivot.shape[0])}, index=data.index)
The unpivot function:
def unpivot(frame):
N, K = frame.shape
return Series(frame.values.ravel('C'), index=[datetime.combine(d[0], d[1]) for d in zip(np.asarray(frame.index).repeat(K), np.tile(np.asarray(frame.ix[0].index), N))])
The center=True on sliding_mean appears to be broken at the moment, will file it in github if I get the chance.
If you're interested in MultiIndexes, check out
df.pivot_table()
. It will create a MultiIndex automatically when
multiple keys are passed in the rows and/or cols parameters.
For example, say you want to pivot the data so there are separate columns for each weekend and non-weekend 30-minute block of the day; you could do that by adding Day, Weekend, and TOD (time-of-day) columns to the DataFrame, and then passing those column names to pivot_table as follows.
pivot = df.pivot_table(values='Usage', rows='Day', cols=['TOD', 'Weekend'])
In this format, pd.rolling_mean()
(or
a function of your creation) can easily be applied to the columns of pivot
. pd.rolling_mean(), like all rolling/moving functions in pandas, even accepts a center
parameter for centered sliding windows.
pd.rolling_mean(pivot, 90, center=True, min_periods=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With