Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split Python sequence (time series/array) into subsequences with overlap

I need to extract all subsequences of a time series/array of a given window. For example:

>>> ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> window = 3
>>> subsequences(ts, window)
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6],
       [5, 6, 7],
       [5, 7, 8],
       [6, 8, 9]])

Naive methods that iterate over the sequence are of course expensive, for example:

def subsequences(ts, window):
    res = []
    for i in range(ts.size - window + 1):
        subts = ts[i:i+window]
        subts.reset_index(drop=True, inplace=True)
        subts.name = None
        res.append(subts)
    return pd.DataFrame(res)

I found a better way by copying the sequence, shifting it by a different value until the window is covered, and splitting the different sequences with reshape. Performance is around 100x better, because the for loop iterates over the window size, and not the sequence size:

def subsequences(ts, window):
    res = []
    for i in range(window):
        subts = ts.shift(-i)[:-(ts.size%window)].reshape((ts.size // window, window))
        res.append(subts)
    return pd.DataFrame(np.concatenate(res, axis=0))

I've seen that pandas includes several rolling functions in the pandas.stats.moment module, and I guess what they do is somehow similar to the subsequencing problem. Is there anywhere in that module, or anywhere else in pandas to make this more efficient?

Thank you!

UPDATE (SOLUTION):

Based on @elyase answer, for this specific case there is a slightly simpler implementation, let me write it down here, and explain what it's doing:

def subsequences(ts, window):
    shape = (ts.size - window + 1, window)
    strides = ts.strides * 2
    return np.lib.stride_tricks.as_strided(ts, shape=shape, strides=strides)

Given the 1-D numpy array, we first compute the shape of the resulting array. We will have a row starting at each position of the array, with just the exception of the last few elements, at which starting them there wouldn't be enough elements next to complete the window.

See on the first example in this description, how the last number we start at is 6, because starting at 7, we can't create a window of three elements. So, the number of rows is the size minus the window plus one. The number of columns is simply the window.

Next, the tricky part is telling how to fill the resulting array, with the shape we just defined.

To do we consider that the first element will be the first. Then we need to specify two values (in a tuple of two integers as the argument to the parameter strides). The values specify the steps we need to do in the original array (the 1-D one) to fill the second (the 2-D one).

Consider a different example, where we want to implement the np.reshape function, from a 9 elements 1-D array, to a 3x3 array. The first element fills the first position, and then, the one at its right, would be the next on the 1-D array, so we move 1 step. Then, the tricky part, to fill the first element of the second row, we should do 3 steps, from the 0 to the 4, see:

>>> original = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> new = array([[0, 1, 2],
                 [3, 4, 5],
                 [6, 7, 8])]

So, to reshape, our steps for the two dimensions would be (1, 3). For our case, where it exists overlap, it is actually simpler. When we move right to fill the resulting array, we start at the next position in the 1-D array, and when we move right, again we get the next element, so 1 step, in the 1-D array. So, the steps would be (1, 1).

There is only one last thing to note. The strides argument does not accept the "steps" we used, but instead the bytes in memory. To know them, we can use the strides method of numpy arrays. It returns a tuple with the strides (steps in bytes), with one element for each dimension. In our case we get a 1 element tuple, and we want it twice, so we have the * 2.

The np.lib.stride_tricks.as_strided function performs the filling using the described method without copying the data, which makes it quite efficient.

Finally, note that the function posted here assumes a 1-D input array (which is different from a 2-D array with 1 element as row or column). See the shape method of the input array, and you should get something like (N, ) and not (N, 1). This method would fail on the latter. Note that the method posted by @elyase handles two dimension input array (that's why this version is slightly simpler).

like image 681
Marc Garcia Avatar asked Jan 09 '15 01:01

Marc Garcia


3 Answers

This is 34x faster than your fast version in my machine:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

>>> rolling_window(ts.values, 3)
array([[0, 1, 2],
      [1, 2, 3],
      [2, 3, 4],
      [3, 4, 5],
      [4, 5, 6],
      [5, 6, 7],
      [6, 7, 8],
      [7, 8, 9]])

Credit goes to Erik Rigtorp.

like image 170
elyase Avatar answered Nov 10 '22 01:11

elyase


I'd like to note that PyTorch offers a single function for this problem which is as memory efficient as the current best solution when working with Torch tensors but is much simpler and more general (i.e. when working with multiple dimensions):

# Import packages
import torch
import pandas as pd
# Create array and set window size
ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
window = 3
# Create subsequences with converting to/from Tensor
ts_torch = torch.from_numpy(ts.values)  # convert to torch Tensor
ss_torch = ts_torch.unfold(0, window, 1) # create subsequences in-memory
ss_numpy = ss_torch.numpy() # convert Tensor back to numpy (obviously now needs more memory)
# Or just in a single line:
ss_numpy = torch.from_numpy(ts.values).unfold(0, window, 1).numpy()

The main point is the unfold function, see the PyTorch docs for detailed explanation. The converting back to numpy may not be required if you're ok to work directly with PyTorch tensors - in that case the solution is just as memory efficient. In my use case, I found it easier to first create subsequences (and to do other preprocessing) using Torch tensors, and use .numpy() on these tensors to convert to numpy as and when needed.

like image 31
Olivier Avatar answered Nov 10 '22 00:11

Olivier


It is worth noting that the stride tricks can have unintended consequences when working on the transformed array. It is efficient because it modifies the memory pointers without creating a copy of the original array. If you update any values in the returned array is changes the values in the original array, and vice-versa.

l = np.asarray([1,2,3,4,5,6,7,8,9])
_ = rolling_window(l, 3)
print(_)
array([[1, 2, 3],
   [2, 3, 4],
   [3, 4, 5],
   [4, 5, 6],
   [5, 6, 7],
   [6, 7, 8],
   [7, 8, 9]])

_[0,1] = 1000
print(_)
array([[   1, 1000,    3],
   [1000,    3,    4],
   [   3,    4,    5],
   [   4,    5,    6],
   [   5,    6,    7],
   [   6,    7,    8],
   [   7,    8,    9]])

# create new matrix from original array
xx = pd.DataFrame(rolling_window(l, 3))
# the updated values are still updated
print(xx)
      0     1  2
0     1  1000  3
1  1000     3  4
2     3     4  5
3     4     5  6
4     5     6  7
5     6     7  8
6     7     8  9

# change values in xx changes values in _ and l
xx.loc[0,1] = 100
print(_)
print(l)
[[  1 100   3]
 [100   3   4]
 [  3   4   5]
 [  4   5   6]
 [  5   6   7]
 [  6   7   8]
 [  7   8   9]]
[  1 100   3   4   5   6   7   8   9]

# make a dataframe copy to avoid unintended side effects
new = xx.copy()
# changing values in new won't affect l, _, or xx

Any values that are changed in the xx or _ or l show up in the other variables because they are all the same object in memory.

See numpy docs for more detail: numpy.lib.stride_tricks.as_strided

like image 24
jkm Avatar answered Nov 09 '22 23:11

jkm