Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get entire dataset from dataloader in PyTorch

How to load entire dataset from the DataLoader? I am getting only one batch of dataset.

This is my code

dataloader = torch.utils.data.DataLoader(dataset=dataset, batch_size=64)
images, labels = next(iter(dataloader))
like image 438
Aakanksha W.S Avatar asked Aug 07 '19 04:08

Aakanksha W.S


People also ask

What does DataLoader in PyTorch return?

DataLoader in your case is supposed to return a list. The output of DataLoader is (inputs batch, labels batch) . e.g. Here, the 64 labels corresponds to 64 images in the batch.

What is the output of DataLoader?

After an import or export, Data Loader generates two CSV output files that contain the results of the operation. One file name starts with “success,” and the other starts with “error.” You can use the Data Loader CSV file viewer to open the files.

What does Num_workers do in PyTorch?

num_workers , which denotes the number of processes that generate batches in parallel. A high enough number of workers assures that CPU computations are efficiently managed, i.e. that the bottleneck is indeed the neural network's forward and backward operations on the GPU (and not data generation).


3 Answers

You can set batch_size=dataset.__len__() in case dataset is torch Dataset, else something like batch_szie=len(dataset) should work.

Beware, this might require a lot of memory depending upon your dataset.

like image 108
asymptote Avatar answered Oct 16 '22 23:10

asymptote


I'm not sure if you want to use the dataset somewhere else than network training (like inspecting the images for example) or want to iterate over the batches during training.

Iterating through the dataset

Either follow Usman Ali's answer (which might overflow) your memory or you could do

for i in range(len(dataset)): # or i, image in enumerate(dataset)
    images, labels = dataset[i] # or whatever your dataset returns

You are able to write dataset[i] because you implemented __len__ and __getitem__ in your Dataset class (as long as it's a subclass of the Pytorch Dataset class).

Getting all batches from the dataloader

The way I understand your question is that you want to retrieve all batches to train the network with. You should understand that iter gives you an iterator of the dataloader (if you're not familiar with the concept of iterators see the wikipedia entry). next tells the iterator to give you the next item.

So, in contrast to an iterator traversing a list the dataloader always returns a next item. List iterators stop at some point. I assume that you have something like a number of epochs and a number of steps per epoch. Then your code would look like this

for i in range(epochs):
    # some code
    for j in range(steps_per_epoch):
        images, labels = next(iter(dataloader))
        prediction = net(images)
        loss = net.loss(prediction, labels)
        ...

Be careful with next(iter(dataloader)). If you wanted to iterate through a list this might also work because Python caches objects but you could end up with a new iterator every time that starts at index 0 again. To avoid this take out the iterator to the top, like so:

iterator = iter(dataloader)
for i in range(epochs):
    for j in range(steps_per_epoch):
        images, labels = next(iterator)
like image 7
Florian Blume Avatar answered Oct 16 '22 21:10

Florian Blume


Another option would be to get the entire dataset directly, without using the dataloader, like so :

images, labels = dataset[:]
like image 4
Jean B. Avatar answered Oct 16 '22 23:10

Jean B.