How does Pytorch Dataloader handle variable size data?

Tags:

I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.

0   24104   27359   6684 0   24104   27359 1   16742   31529   31485 1   16742   31529 2   6579    19316   13091   7181    6579    19316   13091 2   6579    19316   13091   7181    6579    19316 2   6579    19316   13091   7181    6579    19316   13091   6579 2   6579    19316   13091   7181    6579 4   19577   21608 4   19577   21608 4   19577   21608   18373 5   3541    9529 5   3541    9529 6   6832    19218   14144 6   6832    19218 7   9751    23424   25067   12606   26245   23083   12606

I define a custom dataset to handle my click log data.

import torch.utils.data as data class ClickLogDataset(data.Dataset):     def __init__(self, data_path):         self.data_path = data_path         self.uids = []         self.streams = []          with open(self.data_path, 'r') as fdata:             for row in fdata:                 row = row.strip('\n').split('\t')                 self.uids.append(int(row[0]))                 self.streams.append(list(map(int, row[1:])))      def __len__(self):         return len(self.uids)      def __getitem__(self, idx):         uid, stream = self.uids[idx], self.streams[idx]         return uid, stream

Then I use a DataLoader to retrieve mini batches from the data for training.

from torch.utils.data.dataloader import DataLoader clicklog_dataset = ClickLogDataset(data_path) clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)  for uid_batch, stream_batch in stream_data_loader:     print(uid_batch)     print(stream_batch)

The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?

#stream_batch [tensor([24104, 24104, 16742, 16742,  6579,  6579,  6579,  6579, 19577, 19577,         19577,  3541,  3541,  6832,  6832,  9751])]

382

asked Mar 07 '19 10:03

Trung Le

2 Answers

So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.

answered Nov 11 '22 05:11

Jatentaki

This is the way I do it:

def collate_fn_padd(batch):     '''     Padds batch of variable length      note: it converts things ToTensor manually here since the ToTensor transform     assume it takes in images rather than arbitrary tensors.     '''     ## get sequence lengths     lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)     ## padd     batch = [ torch.Tensor(t).to(device) for t in batch ]     batch = torch.nn.utils.rnn.pad_sequence(batch)     ## compute mask     mask = (batch != 0).to(device)     return batch, lengths, mask

then I pass that to the dataloader class as a collate_fn.

There seems to be a giant list of different posts in the pytorch forum. Let me link to all of them. They all have answers of their own and discussions. It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share.

It would be nice that the ideal answer mentions

efficiency, e.g. if to do the processing in GPU with torch in the collate function vs numpy

things of that sort.

List:

https://discuss.pytorch.org/t/how-to-create-batches-of-a-list-of-varying-dimension-tensors/50773
https://discuss.pytorch.org/t/how-to-create-a-dataloader-with-variable-size-input/8278
https://discuss.pytorch.org/t/using-variable-sized-input-is-padding-required/18131
https://discuss.pytorch.org/t/dataloader-for-various-length-of-data/6418
https://discuss.pytorch.org/t/how-to-do-padding-based-on-lengths/24442

bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284

answered Nov 11 '22 03:11

Charlie Parker

Related questions
                            
                                black as pre-commit hook always fails my commits
                            
                                How to use Set with react's useState?
                            
                                IDX10501: Signature validation failed. Unable to match keys
                            
                                Type annotations for Django models
                            
                                Flutter: Java uses or overrides a deprecated API
                            
                                Get instance of subtype of a model with Eloquent
                            
                                Use Serilog with Microsoft.Extensions.Logging.ILogger
                            
                                Vue 3 Composition API - How to get the component element ($el) on which component is mounted
                            
                                Do you need to override hashCode() and equals() for records?
                            
                                Why can the compiler not optimize floating point addition with 0? [duplicate]
                            
                                There was no runtime pack for Microsoft.AspNetCore.App available for the specified RuntimeIdentifier 'browser-wasm'
                            
                                C# check if a COM (Serial) port is already open

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With