Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch: Dataloader for time series task

I have a Pandas dataframe with n rows and k columns loaded into memory. I would like to get batches for a forecasting task where the first training example of a batch should have shape (q, k) with q referring to the number of rows from the original dataframe (e.g. 0:128). The next example should be (128:256, k) and so on. So, ultimately, one batch should have the shape (32, q, k) with 32 corresponding to the batch size.

Since TensorDataset from data_utils does not work here, I am wondering what the best way would be. I tried to use np.array_split() to get as first dimension the number of possible splits of q values in order to write a custom DataLoader but then reshaping is not guaranteed to work since not all arrays have the same shape.

Here is a minimal example to make it more clear. In this case, batch size is 3 and q is 2:

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0,30).reshape(10,3),columns=['A','B','C'])

The dataset:

    A   B   C
0   0   1   2
1   3   4   5
2   6   7   8
3   9   10  11
4   12  13  14
5   15  16  17
6   18  19  20
7   21  22  23
8   24  25  26
9   27  28  29

The first batch in this case should have the shape (3,2,3) and look like:

array([[[ 0.,  1.,  2.],
        [ 3.,  4.,  5.]],

       [[ 3.,  4.,  5.],
        [ 6.,  7.,  8.]],

       [[ 6.,  7.,  8.],
        [ 9., 10., 11.]]])
like image 338
beginneR Avatar asked Sep 11 '19 16:09

beginneR


Video Answer


2 Answers

You can write your analog of the TensorDataset. To do this you need to inherit from the Dataset class.

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data_frame, q):
        self.data = data_frame.values
        self.q = q

    def __len__(self):
        return self.data.shape[0] // self.q

    def __getitem__(self, index):
        return self.data[index * self.q: (index+1) * self.q]
like image 129
antoleb Avatar answered Sep 30 '22 18:09

antoleb


I ended up writing custom dataset as well, though it's a bit different from the answer above:

class TimeseriesDataset(torch.utils.data.Dataset):   
    def __init__(self, X, y, seq_len=1):
        self.X = X
        self.y = y
        self.seq_len = seq_len

    def __len__(self):
        return self.X.__len__() - (self.seq_len-1)

    def __getitem__(self, index):
        return (self.X[index:index+self.seq_len], self.y[index+self.seq_len-1])

And the usage looks like that:

train_dataset = TimeseriesDataset(X_lstm, y_lstm, seq_len=4)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = 3, shuffle = False)

for i, d in enumerate(train_loader):
    print(i, d[0].shape, d[1].shape)

>>>
# shape: tuple((batch_size, seq_len, n_features), (batch_size))
0 torch.Size([3, 4, 2]) torch.Size([3])
like image 20
Eugene Tartakovsky Avatar answered Sep 30 '22 17:09

Eugene Tartakovsky