Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Pytorch Dataloader handle variable size data?

Tags:

I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.

0   24104   27359   6684 0   24104   27359 1   16742   31529   31485 1   16742   31529 2   6579    19316   13091   7181    6579    19316   13091 2   6579    19316   13091   7181    6579    19316 2   6579    19316   13091   7181    6579    19316   13091   6579 2   6579    19316   13091   7181    6579 4   19577   21608 4   19577   21608 4   19577   21608   18373 5   3541    9529 5   3541    9529 6   6832    19218   14144 6   6832    19218 7   9751    23424   25067   12606   26245   23083   12606 

I define a custom dataset to handle my click log data.

import torch.utils.data as data class ClickLogDataset(data.Dataset):     def __init__(self, data_path):         self.data_path = data_path         self.uids = []         self.streams = []          with open(self.data_path, 'r') as fdata:             for row in fdata:                 row = row.strip('\n').split('\t')                 self.uids.append(int(row[0]))                 self.streams.append(list(map(int, row[1:])))      def __len__(self):         return len(self.uids)      def __getitem__(self, idx):         uid, stream = self.uids[idx], self.streams[idx]         return uid, stream 

Then I use a DataLoader to retrieve mini batches from the data for training.

from torch.utils.data.dataloader import DataLoader clicklog_dataset = ClickLogDataset(data_path) clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)  for uid_batch, stream_batch in stream_data_loader:     print(uid_batch)     print(stream_batch) 

The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?

#stream_batch [tensor([24104, 24104, 16742, 16742,  6579,  6579,  6579,  6579, 19577, 19577,         19577,  3541,  3541,  6832,  6832,  9751])] 
like image 382
Trung Le Avatar asked Mar 07 '19 10:03

Trung Le


People also ask

How does PyTorch data loader work?

Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

What is the difference between a PyTorch dataset and a PyTorch DataLoader?

Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

What is batch size in DataLoader PyTorch?

PyTorch dataloader batch size The batch size is equal to the number of samples in the training data. Code: In the following code, we will import the torch module from which we can process the number of samples before the model is updated. datasets = impdataset(1001) is used as a dataset.


2 Answers

So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.

like image 68
Jatentaki Avatar answered Nov 11 '22 05:11

Jatentaki


This is the way I do it:

def collate_fn_padd(batch):     '''     Padds batch of variable length      note: it converts things ToTensor manually here since the ToTensor transform     assume it takes in images rather than arbitrary tensors.     '''     ## get sequence lengths     lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)     ## padd     batch = [ torch.Tensor(t).to(device) for t in batch ]     batch = torch.nn.utils.rnn.pad_sequence(batch)     ## compute mask     mask = (batch != 0).to(device)     return batch, lengths, mask 

then I pass that to the dataloader class as a collate_fn.


There seems to be a giant list of different posts in the pytorch forum. Let me link to all of them. They all have answers of their own and discussions. It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share.

It would be nice that the ideal answer mentions

  • efficiency, e.g. if to do the processing in GPU with torch in the collate function vs numpy

things of that sort.

List:

  • https://discuss.pytorch.org/t/how-to-create-batches-of-a-list-of-varying-dimension-tensors/50773
  • https://discuss.pytorch.org/t/how-to-create-a-dataloader-with-variable-size-input/8278
  • https://discuss.pytorch.org/t/using-variable-sized-input-is-padding-required/18131
  • https://discuss.pytorch.org/t/dataloader-for-various-length-of-data/6418
  • https://discuss.pytorch.org/t/how-to-do-padding-based-on-lengths/24442

bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284

like image 39
Charlie Parker Avatar answered Nov 11 '22 03:11

Charlie Parker