I have a Pandas dataframe with n
rows and k
columns loaded into memory. I would like to get batches for a forecasting task where the first training example of a batch should have shape (q, k)
with q
referring to the number of rows from the original dataframe (e.g. 0:128). The next example should be (128:256, k)
and so on. So, ultimately, one batch should have the shape (32, q, k)
with 32 corresponding to the batch size.
Since TensorDataset
from data_utils
does not work here, I am wondering what the best way would be. I tried to use np.array_split()
to get as first dimension the number of possible splits of q values in order to write a custom DataLoader but then reshaping is not guaranteed to work since not all arrays have the same shape.
Here is a minimal example to make it more clear. In this case, batch size is 3 and q is 2:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0,30).reshape(10,3),columns=['A','B','C'])
The dataset:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
6 18 19 20
7 21 22 23
8 24 25 26
9 27 28 29
The first batch in this case should have the shape (3,2,3) and look like:
array([[[ 0., 1., 2.],
[ 3., 4., 5.]],
[[ 3., 4., 5.],
[ 6., 7., 8.]],
[[ 6., 7., 8.],
[ 9., 10., 11.]]])
You can write your analog of the TensorDataset. To do this you need to inherit from the Dataset class.
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, data_frame, q):
self.data = data_frame.values
self.q = q
def __len__(self):
return self.data.shape[0] // self.q
def __getitem__(self, index):
return self.data[index * self.q: (index+1) * self.q]
I ended up writing custom dataset as well, though it's a bit different from the answer above:
class TimeseriesDataset(torch.utils.data.Dataset):
def __init__(self, X, y, seq_len=1):
self.X = X
self.y = y
self.seq_len = seq_len
def __len__(self):
return self.X.__len__() - (self.seq_len-1)
def __getitem__(self, index):
return (self.X[index:index+self.seq_len], self.y[index+self.seq_len-1])
And the usage looks like that:
train_dataset = TimeseriesDataset(X_lstm, y_lstm, seq_len=4)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = 3, shuffle = False)
for i, d in enumerate(train_loader):
print(i, d[0].shape, d[1].shape)
>>>
# shape: tuple((batch_size, seq_len, n_features), (batch_size))
0 torch.Size([3, 4, 2]) torch.Size([3])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With