When does dataloader shuffle happen for Pytorch?

Tags:

I have beening using shuffle option for pytorch dataloader for many times. But I was wondering when this shuffle happens and whether it is performed dynamically during iteration. Take the following code as an example:

namesDataset = NamesDataset()
namesTrainLoader = DataLoader(namesDataset, batch_size=16, shuffle=True)
for batch_data in namesTrainLoader:
    print(batch_data)

When we define "namesTrainLoader", does that mean the shuffling is finished and the following iteration will be based on a fixed order of data? Will there be any randomness in the for loop after namesTrainLoader was defined?

I was trying to replace half of "batch_data" with some special value:

for batch_data in namesTrainLoader:
    batch_data[：8] = special_val
    pre = model(batch_data)

Let us say there will be infinite number of epoches, will "model" eventually see all the data in "namesTrainLoader"? Or half of the data of "namesTrainLoader" is actually lost to "model"?

885

asked May 10 '20 21:05

Jim Wang

1 Answers

The shuffling happens when the iterator is created. In the case of the for loop, that happens just before the for loop starts.

You can create the iterator manually with:

# Iterator gets created, the data has been shuffled at this point.
data_iterator = iter(namesTrainLoader)

By default the data loader uses torch.utils.data.RandomSampler if you set shuffle=True (without providing your own sampler). Its implementation is very straight forward and you can see where the data is shuffled when the iterator is created by looking at the RandomSampler.__iter__ method:

def __iter__(self):
    n = len(self.data_source)
    if self.replacement:
        return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
    return iter(torch.randperm(n).tolist())

The return statement is the important part, where the shuffling takes place. It simply creates a random permutation of the indices.

That means you will see your entire dataset every time you fully consume the iterator, just in a different order every time. Therefore there is no data lost (not including cases with drop_last=True) and your model will see all data at every epoch.

157

answered Sep 19 '22 16:09

Michael Jungo

Related questions
                            
                                How to avoid conda activate base from automatically executing in my VS Code editor?
                            
                                unauthorized_client: Grant type 'authorization_code' not allowed for the client. Django -auth0 -login
                            
                                How to replace loss function during training tensorflow.keras
                            
                                Django: how to get Foreign key id?
                            
                                find least common denominator for list of fractions in python
                            
                                Reindex MultiIndex with unique values in level
                            
                                Can I convert spectrograms generated with librosa back to audio?
                            
                                How do I setup my own time zone in Django?
                            
                                Librosa raised OSError('sndfile library not found') in Docker
                            
                                AttributeError: module 'os' has no attribute 'uname
                            
                                Discord.py - how to detect if a user mentions/pings the bot
                            
                                Is this a bug or do I not understand something?
                            
                                Change colors in python dash plotly theme
                            
                                Python unittest setting a global variable correctly
                            
                                Import error: No module named 'secrets' - python manage.py not working after pull to Digital Ocean
                            
                                Difference between rect.move() and rect.move_ip in pygame
                            
                                Passing Ipython variables as string arguments to shell command
                            
                                Groupby and shift a dask dataframe
                            
                                WARNING: WARNING:tensorflow:Model was constructed with shape (None, 150) , but it was called on an input with incompatible shape (None, 1)
                            
                                Adding products to cart not working properly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When does dataloader shuffle happen for Pytorch?

Tags:

python

shuffle

machine-learning

pytorch

dataloader

Jim Wang

People also ask

1 Answers

Michael Jungo

Recent Activity

Donate For Us