I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.
0 24104 27359 6684 0 24104 27359 1 16742 31529 31485 1 16742 31529 2 6579 19316 13091 7181 6579 19316 13091 2 6579 19316 13091 7181 6579 19316 2 6579 19316 13091 7181 6579 19316 13091 6579 2 6579 19316 13091 7181 6579 4 19577 21608 4 19577 21608 4 19577 21608 18373 5 3541 9529 5 3541 9529 6 6832 19218 14144 6 6832 19218 7 9751 23424 25067 12606 26245 23083 12606
I define a custom dataset to handle my click log data.
import torch.utils.data as data class ClickLogDataset(data.Dataset): def __init__(self, data_path): self.data_path = data_path self.uids = [] self.streams = [] with open(self.data_path, 'r') as fdata: for row in fdata: row = row.strip('\n').split('\t') self.uids.append(int(row[0])) self.streams.append(list(map(int, row[1:]))) def __len__(self): return len(self.uids) def __getitem__(self, idx): uid, stream = self.uids[idx], self.streams[idx] return uid, stream
Then I use a DataLoader to retrieve mini batches from the data for training.
from torch.utils.data.dataloader import DataLoader clicklog_dataset = ClickLogDataset(data_path) clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16) for uid_batch, stream_batch in stream_data_loader: print(uid_batch) print(stream_batch)
The code above returns differently from what I expected, I want stream_batch
to be a 2D tensor of type integer of length 16
. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?
#stream_batch [tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577, 19577, 3541, 3541, 6832, 6832, 9751])]
Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.
Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.
PyTorch dataloader batch size The batch size is equal to the number of samples in the training data. Code: In the following code, we will import the torch module from which we can process the number of samples before the model is updated. datasets = impdataset(1001) is used as a dataset.
So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader
has a collate_fn
parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn
, which for instance 0
-pads the input, truncates it to some predefined length or applies any other operation of your choice.
This is the way I do it:
def collate_fn_padd(batch): ''' Padds batch of variable length note: it converts things ToTensor manually here since the ToTensor transform assume it takes in images rather than arbitrary tensors. ''' ## get sequence lengths lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device) ## padd batch = [ torch.Tensor(t).to(device) for t in batch ] batch = torch.nn.utils.rnn.pad_sequence(batch) ## compute mask mask = (batch != 0).to(device) return batch, lengths, mask
then I pass that to the dataloader class as a collate_fn
.
There seems to be a giant list of different posts in the pytorch forum. Let me link to all of them. They all have answers of their own and discussions. It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share.
It would be nice that the ideal answer mentions
things of that sort.
List:
bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With