Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sliding window average across time axes

Tags:

python

pandas

I have a large time-series set of data at 30 minute intervals and trying to do a sliding window on this set of data but separately for each point of the day using pandas.

I'm no statistician and not great at thinking or coding for this sort of work but here is my clumsy attempt at doing what I want. I'm really looking for help improving it as I know there will be a better way of doing this, possibly using MultiIndexes and some proper iteration? But I have struggled to do this across the 'time-axes'.

def sliding_window(run,data,type='mean'):
    data = data.asfreq('30T')
    for x in date_range(run.START, run.END, freq='1d'):
        if int(datetime.strftime(x, "%w")) == 0 or int(datetime.strftime(x, "%w")) == 6:
            points = data.select(weekends).truncate(x - relativedelta(days=run.WINDOW),x + relativedelta(days=run.WINDOW)).groupby(lambda date: minutes(date, x)).mean()
        else:
            points = data.select(weekdays).truncate(x - relativedelta(days=run.WINDOW),x + relativedelta(days=run.WINDOW)).groupby(lambda date: minutes(date, x)).mean()
        for point in points.index:
            data[datetime(x.year,x.month,x.day,point.hour,point.minute)] = points[point]
    return data

run.START, run.END and run.WINDOW are two points within data and 45 (days). I've been staring at this code a lot so I'm not sure what (if any) of it make sense to anyone else, please ask so that I can clarify anything else.

SOLVED: (Solution courtesy of crewbum)

The modified function which as expected goes stupidly fast:

def sliding_window(run,data,am='mean',days='weekdays'):
    data = data.asfreq('30T')
    data = DataFrame({'Day': [d.date() for d in data.index], 'Time': [d.time() for d in data.index], 'Weekend': [weekday_string(d) for d in data.index], 'data': data})
    pivot = data.pivot_table(values='data', rows='Day', cols=['Weekend', 'Time'])
    pivot = pivot[days]
    if am == 'median':
        mean = rolling_median(pivot, run.WINDOW*2, min_periods=1)
    mean = rolling_mean(pivot, run.WINDOW*2, min_periods=1)
    return DataFrame({'mean': unpivot(mean), 'amax': np.tile(pivot.max().values, pivot.shape[0]), 'amin': np.tile(pivot.min().values, pivot.shape[0])}, index=data.index)

The unpivot function:

def unpivot(frame):
    N, K = frame.shape
    return Series(frame.values.ravel('C'), index=[datetime.combine(d[0], d[1]) for d in zip(np.asarray(frame.index).repeat(K), np.tile(np.asarray(frame.ix[0].index), N))])

The center=True on sliding_mean appears to be broken at the moment, will file it in github if I get the chance.

like image 357
Ben Hussey Avatar asked Oct 06 '22 15:10

Ben Hussey


1 Answers

If you're interested in MultiIndexes, check out df.pivot_table(). It will create a MultiIndex automatically when multiple keys are passed in the rows and/or cols parameters.

For example, say you want to pivot the data so there are separate columns for each weekend and non-weekend 30-minute block of the day; you could do that by adding Day, Weekend, and TOD (time-of-day) columns to the DataFrame, and then passing those column names to pivot_table as follows.

pivot = df.pivot_table(values='Usage', rows='Day', cols=['TOD', 'Weekend'])

In this format, pd.rolling_mean() (or a function of your creation) can easily be applied to the columns of pivot. pd.rolling_mean(), like all rolling/moving functions in pandas, even accepts a center parameter for centered sliding windows.

pd.rolling_mean(pivot, 90, center=True, min_periods=1)
like image 174
Garrett Avatar answered Oct 10 '22 03:10

Garrett