rolling statistics in numpy or pytroch

Question

I have a tensors data of sensors, each tensor is of shape (4,1500) This is 1500 timepoints and for each time point I have 4 features. I want to "smooth" the sequences with rolling average or other rolling statistics. The end goal is to try to improve an lstm autoencoder with rolling statistics instead of the long raw sequence. I am familiar with rolling windows of pandas and currently I am doing this:

#tensor shape:
 data.shape
 (4,1500)

 #convert data to numpy array and then to dataframe and perform rolling mean
 rolled_data=pd.DataFrame(data.numpy().swapaxes(1,0)).rolling(10).mean()[::10]
 rolled_data.shape
 (150, 4)

 # convert back the dataframe to tensor
 tensor_rolled_data=torch.Tensor(rolled_data.to_numpy().swapaxes(1,0))
 tensor_rolled_data.shape
 torch.Size([4, 150])

my question is- is there a better way to do it? a function in numpy/torch that can do rolling statistics in a cleaner or more efficient way?

jodag · Accepted Answer

Since you're striding the output by the size of the window this is actually more akin to downsampling by averaging than to a computing rolling statistics. We can take advantage of the fact that there are no overlaps by simply reshaping the initial tensor.

Using `Tensor.reshape`

Assuming your data tensor has a shape divisible by 10 then you can just reshape the tensor to shape (4, 150, 10) and compute the statistic along the last dimension. For example

win_size = 10
tensor_rolled_data = data.reshape(data.shape[0], -1, win_size).mean(dim=2)

This solution doesn't give exactly the same results as your tensor_rolled_data since in this solution the first entry will contain the mean of the first 10 samples, the second entry will contain the mean of the second 10 samples, etc... The pandas solution is a "causal filter" so the first entry will contain the mean of the 10 most recent samples up to and including sample 0, the second will contain the 10 most recent samples up to and including sample 10, etc... (Note that the first entry is nan in the pandas solution since less than 10 preceding samples exist).

If this difference is unacceptable you can recreate the pandas result by first padding with 9 nan values and clipping off the last 9 samples.

import torch.nn.functional as F
win_size = 10
# pad with `nan` to match behavior of pandas
data_padded = F.pad(data[None, :, :-(win_size - 1)], (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
# find mean of groups of N samples
tensor_rolled_data = data_padded.reshape(data.shape[0], -1, win_size).mean(dim=2)

Using `Tensor.unfold`

To address the comment about what to do when there are overlaps. If you're only interested in the mean statistic then there are a number of ways to compute this (e.g. convolution, average pooling, tensor unfolding). That said, Tensor.unfold gives the most general solution since it could be used to compute any statistic over a window. For example

# same as first example above
win_size = 10
tensor_rolled_data = data.unfold(dimension=1, size=win_size, step=win_size).mean(dim=2)

or

# same as second example above
import torch.nn.functional as F
win_size = 10
data_padded = F.pad(data.unsqueeze(0), (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
tensor_rolled_data = data_padded.unfold(dimension=1, size=win_size, step=win_size).mean(dim=2)

In the above cases, unfolding produces the same result as reshape since size and step are equal. However, unlike reshape, unfolding also supports size != step.

win_size = 10
stride = 2
tensor_rolled_data = data.unfold(1, win_size, stride).mean(dim=2).mean(dim=2)
# produces shape [4, 746]

or you can pad the front of the features with win_size - 1 values to achieve the same result as pandas.

import torch.nn.functional as F
win_size = 10
stride = 2
data_padded = F.pad(data.unsqueeze(0), (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
tensor_rolled_data = data_padded.unfold(1, win_size, stride).mean(dim=2)
# produces shape [4, 750]

Note In practice you probably don't want to pad with NaN since this will probably become quite a headache. Instead you could use zero padding, 'replicate' padding, or 'mirror' padding.

rolling statistics in numpy or pytroch

Tags:

python

pandas

numpy

pytorch

Rachel Shalom

1 Answers

Using `Tensor.reshape`

Using `Tensor.unfold`

jodag

Recent Activity

Donate For Us

rolling statistics in numpy or pytroch

Tags:

python

pandas

numpy

pytorch

Rachel Shalom

1 Answers

Using Tensor.reshape

Using Tensor.unfold

jodag

Related questions

Recent Activity

Donate For Us

Using `Tensor.reshape`

Using `Tensor.unfold`