Why pytorch DataLoader behaves differently on numpy array and list?

Tags:

The only difference is one of the parameter passed to DataLoader is in type "numpy.array" and the other is in type "list", but the DataLoader gives totally different results.

You can use the following code to reproduce it:

from torch.utils.data import DataLoader,Dataset
import numpy as np

class my_dataset(Dataset):
    def __init__(self,data,label):
        self.data=data
        self.label=label          
    def __getitem__(self, index):
        return self.data[index],self.label[index]
    def __len__(self):
        return len(self.data)

train_data=[[1,2,3],[5,6,7],[11,12,13],[15,16,17]]
train_label=[-1,-2,-11,-12]

########################### Look at here:    

test=DataLoader(dataset=my_dataset(np.array(train_data),train_label),batch_size=2)
for i in test:
    print ("numpy data:")
    print (i)
    break


test=DataLoader(dataset=my_dataset(train_data,train_label),batch_size=2)
for i in test:
    print ("list data:")
    print (i)
    break

The result is:

numpy data:
[tensor([[1, 2, 3],
        [5, 6, 7]]), tensor([-1, -2])]
list data:
[[tensor([1, 5]), tensor([2, 6]), tensor([3, 7])], tensor([-1, -2])]

564

asked Oct 15 '18 13:10

Statham

1 Answers

This is because how batching is handled in torch.utils.data.DataLoader. collate_fn argument decides how samples from samples are merged into a single batch. Default for this argument is undocumented torch.utils.data.default_collate.

This function handles batching by assuming numbers/tensors/ndarrays are primitive data to batch and lists/tuples/dicts containing these primitives as structure to be (recursively) preserved. This allow you to have a semantic batching like this:

(input_tensor, label_tensor) -> (batched_input_tensor, batched_label_tensor)
([input_tensor_1, input_tensor_2], label_tensor) -> ([batched_input_tensor_1, batched_input_tensor_2], batched_label_tensor)
{'input': input_tensor, 'target': target_tensor} -> {'input': batched_input_tensor, 'target': batched_target_tensor}

(Left side of -> is output of dataset[i], while right side is batched sample from torch.utils.data.DataLoader)

Your example code is similar to example 2 above: list structure is preserved while ints are batched.

182

answered Oct 01 '22 13:10

Sasank Chilamkurthy

Related questions
                            
                                How to download files from s3 given the file path using boto3 in python
                            
                                Using `super()` within `__init_subclass__` doesn't find parent's classmethod [duplicate]
                            
                                TypeError: Inheritance a class from URL is forbidden
                            
                                get file metadata from S3 using Python boto
                            
                                How to know the number of tree created in XGBoost
                            
                                Why is super().__init__(*args,**kwargs) being used when class doesn't specify a superclass?
                            
                                How can I get data from Django Headers?
                            
                                pandas read in MultiIndex data from csv file
                            
                                Python 3 - Google Drive API: AttributeError: 'Resource' object has no attribute 'children'
                            
                                Gensim Word2Vec select minor set of word vectors from pretrained model
                            
                                dask: specify number of processes
                            
                                Mouseover event for a PyQT5 Label
                            
                                Calculate days until your next birthday in python
                            
                                How could I detect subtypes in pandas object columns?
                            
                                Python : Django TypeError: object() takes no parameters
                            
                                Pandas: Conditionally replace values based on other columns values
                            
                                Django 2.1 - 'functools.partial' object has no attribute '__name__'
                            
                                How to convert RGB images to grayscale in PyTorch dataloader?
                            
                                How to get column name for second largest row value in pandas DataFrame
                            
                                fastest way to share data between a C++ and Python program? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why pytorch DataLoader behaves differently on numpy array and list?

Tags:

python

iterator

list

numpy

pytorch

Statham

People also ask

1 Answers

Sasank Chilamkurthy

Recent Activity

Donate For Us