I'm creating a custom dataset for NLP-related tasks. In the PyTorch custom datast tutorial, we see that the <code>__getitem__()</code> method leaves room for a transform before it returns a sample: <pre class="prettyprint"><code>def __getitem__(self, idx): if torch.is_tensor(idx): idx = idx.tolist() img_name = os.path.join(self.root_dir, self.landmarks_frame.iloc[idx, 0]) image = io.imread(img_name) ### SOME DATA MANIPULATION HERE ### sample = {'image': image, 'landmarks': landmarks} if self.transform: sample = self.transform(sample) return sample </code></pre> However, the code here: <pre class="prettyprint"><code> if torch.is_tensor(idx): idx = idx.tolist() </code></pre> implies that multiple items should be able to be retrieved at a time which leaves me wondering: <ol> <li> How does that transform work on multiple items? Take the custom transforms in the tutorial for example. They do not look like they could be applied to a batch of samples in a single call. </li> <li> Related, how does a DataLoader retrieve a batch of multiple samples in parallel and apply said transform if the transform can only be applied to a single sample? </li> </ol>

<ol> <li> How does that transform work on multiple items? They work on multiple items through use of the data loader. By using transforms, you are specifying what should happen to a single emission of data (e.g., <code>batch_size=1</code>). The data loader takes your specified <code>batch_size</code> and makes <code>n</code> calls to the <code>__getitem__</code> method in the torch data set, applying the transform to each sample sent into training/validation. It then collates <code>n</code> samples into your batch size emitted from the data loader. </li> <li> Related, how does a DataLoader retrieve a batch of multiple samples in parallel and apply said transform if the transform can only be applied to a single sample? Hopefully above makes sense to you. Parallelization is done by the torch data set class and the data loader, where you specify <code>num_workers</code>. Torch will pickle the data set and spread it across workers. </li> </ol>

How does PyTorch DataLoader interact with a PyTorch dataset to transform batches?

Tags:

python

pytorch

pytorch-dataloader

I'm creating a custom dataset for NLP-related tasks.

In the PyTorch custom datast tutorial, we see that the __getitem__() method leaves room for a transform before it returns a sample:

def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_name = os.path.join(self.root_dir,
                                self.landmarks_frame.iloc[idx, 0])
        image = io.imread(img_name)
       
        ### SOME DATA MANIPULATION HERE ###

        sample = {'image': image, 'landmarks': landmarks}
        if self.transform:
            sample = self.transform(sample)

        return sample

However, the code here:

        if torch.is_tensor(idx):
            idx = idx.tolist()

implies that multiple items should be able to be retrieved at a time which leaves me wondering:

How does that transform work on multiple items? Take the custom transforms in the tutorial for example. They do not look like they could be applied to a batch of samples in a single call.
Related, how does a DataLoader retrieve a batch of multiple samples in parallel and apply said transform if the transform can only be applied to a single sample?

853

asked Feb 25 '21 14:02

rocksNwaves

2 Answers

How does that transform work on multiple items? They work on multiple items through use of the data loader. By using transforms, you are specifying what should happen to a single emission of data (e.g., batch_size=1). The data loader takes your specified batch_size and makes n calls to the __getitem__ method in the torch data set, applying the transform to each sample sent into training/validation. It then collates n samples into your batch size emitted from the data loader.
Related, how does a DataLoader retrieve a batch of multiple samples in parallel and apply said transform if the transform can only be applied to a single sample? Hopefully above makes sense to you. Parallelization is done by the torch data set class and the data loader, where you specify num_workers. Torch will pickle the data set and spread it across workers.

answered Oct 29 '22 04:10

John Stud

from the documentation of transforms from torchvision:

All transformations accept PIL Image, Tensor Image or batch of Tensor Images as input. Tensor Image is a tensor with (C, H, W) shape, where C is a number of channels, H and W are image height and width. Batch of Tensor Images is a tensor of (B, C, H, W) shape, where B is a number of images in the batch. Deterministic or random transformations applied on the batch of Tensor Images identically transform all the images of the batch.

This means that you can pass a batch of images, and the transform will be applied to the whole batch, as long as it respects the shape. The list indexes act on the iloc from the dataframe, which selects either a single index or a list of them, returning the requested subset.

answered Oct 29 '22 03:10

Maura Pintor

Related questions
                            
                                What is the correct boilerplate for explicit relative imports?
                            
                                Python concurrent.futures Error in atexit._run_exitfuncs: OSError: handle is closed only running in Visual studio Debugging Mode
                            
                                Scrapy hidden memory leak
                            
                                How to convert a dataframe from long to wide, with values grouped by year in the index?
                            
                                How to specify external system dependencies to a Python package?
                            
                                creating a json object from pandas dataframe
                            
                                Decrypting AES CBC in python from OpenSSL AES
                            
                                Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)
                            
                                How to display two figures, side by side, in a Jupyter cell
                            
                                Early stopping with multiple conditions
                            
                                how to solve bug on snake wall teleportation
                            
                                How does joblib.Parallel deal with global variables?
                            
                                Make helix from two objects
                            
                                How to have pandas perform a rolling average on a non-uniform x-grid
                            
                                How to quickly check if domain exists? [duplicate]
                            
                                How to make a module reload in python after the script is compiled?
                            
                                How do I properly import python modules in a multi directory project?
                            
                                Running two dask-ml imputers simultaneously instead of sequentially
                            
                                Edit Terraform configuration files programmatically with Python
                            
                                Extract IBAN from text with Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With