Running through a dataloader in Pytorch using Google Colab

I am trying to use Pytorch to run classification on a dataset of images of cats and dogs. In my code I am so far downloading the data and going into the folder train which has two folders in it called "cats" and "dogs." I am then trying to load this data into a dataloader and iterate through batches, but it is giving me some error I don't understand in the iteration step.

Since it is Google Colabs I have code in there for downloading data and installing libraries. Any other advice on my code so far would be appreciated as well.

!pip install torch
!pip install torchvision

from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
# For showing and formatting images
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# For importing datasets into pytorch
import torchvision.datasets as dataset

# Used for dataloaders
import torch.utils.data as data

# For pretrained resnet34 model
import torchvision.models as models

# For optimisation function
import torch.nn as nn
import torch.optim as optim

!wget http://files.fast.ai/data/dogscats.zip
!unzip dogscats.zip    

batch_size = 256

train_raw = dataset.ImageFolder(PATH+"train", transform=transforms.ToTensor())
train_loader = data.DataLoader(train_raw, batch_size=batch_size, shuffle=True)

for batch_idx, (data, target) in enumerate(train_loader):
  print("Data: ", batch_idx)

The error comes up on the last lines and is below:

RuntimeErrorTraceback (most recent call last)
<ipython-input-66-c32dd0c1b880> in <module>()
----> 1 for batch_idx, (data, target) in enumerate(train_loader):
      2   print("Data: ", batch_idx)

/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in __next__(self)
    257         if self.num_workers == 0:  # same-process loading
    258             indices = next(self.sample_iter)  # may raise StopIteration
--> 259             batch = self.collate_fn([self.dataset[i] for i in indices])
    260             if self.pin_memory:
    261                 batch = pin_memory_batch(batch)

/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in default_collate(batch)
    133     elif isinstance(batch[0], collections.Sequence):
    134         transposed = zip(*batch)
--> 135         return [default_collate(samples) for samples in transposed]
    137     raise TypeError((error_msg.format(type(batch[0]))))

/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in default_collate(batch)
    110             storage = batch[0].storage()._new_shared(numel)
    111             out = batch[0].new(storage)
--> 112         return torch.stack(batch, 0, out=out)
    113     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
    114             and elem_type.__name__ != 'string_':

/usr/local/lib/python2.7/dist-packages/torch/functional.pyc in stack(sequence, dim, out)
     62     inputs = [t.unsqueeze(dim) for t in sequence]
     63     if out is None:
---> 64         return torch.cat(inputs, dim)
     65     else:
     66         return torch.cat(inputs, dim, out=out)

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 400 and 487 in dimension 2 at /pytorch/torch/lib/TH/generic/THTensorMath.c:2897


1 Answers

I think the main problem was images being of different size . I may have understood ImageFolder in other way but, i think you don't need labels for images if the directory structure is as specified in pytorch and pytorch will figure out the labels for you. I would also add more things to your transform that automatically resizes every images from the folder such as:

   normalize = transforms.Normalize(
                        mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225]
   transform = transforms.Compose(

Also you can use other tricks to make your DataLoader much faster such as adding batch_size and number of cpu workers such as:

    testloader = DataLoader(testset, batch_size=16,
                         shuffle=False, num_workers=4)

I think this will make you pipeline much faster.

