Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I load custom image based datasets into Pytorch for use with a CNN?

I have searched for hours on the internet to find a good solution to my issue. Here is some relevant background information to help you answer my question.

This is my first ever deep learning project and I have no idea what I am doing. I know the theory but not the practical elements.

The data that I am using can be found on kaggle at this link: (https://www.kaggle.com/alxmamaev/flowers-recognition)

I am aiming to classify flowers based on the images provided in the dataset using a CNN.

Here is some sample code I have tried to use to load data in so far, this is my best attempt but as I mentioned I am clueless and Pytorch docs didn't offer much help that I could understand at my level. (https://pastebin.com/fNLVW1UW)

    # Loads the images for use with the CNN.
def load_images(image_size=32, batch_size=64, root="../images"):
    transform = transforms.Compose([
        transforms.Resize(32),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    train_set = datasets.ImageFolder(root=root, train=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=2)

    return train_loader


# Defining variables for use with the CNN.
classes = ('daisy', 'dandelion', 'rose', 'sunflower', 'tulip')
train_loader_data = load_images()

# Training samples.
n_training_samples = 3394
train_sampler = SubsetRandomSampler(np.arange(n_training_samples, dtype=np.int64))

# Validation samples.
n_val_samples = 424
val_sampler = SubsetRandomSampler(np.arange(n_training_samples, n_training_samples + n_val_samples, dtype=np.int64))

# Test samples.
n_test_samples = 424
test_sampler = SubsetRandomSampler(np.arange(n_test_samples, dtype=np.int64))

Here are my direct questions that I require answers too:

  • How do I fix my code to load in the dataset in an 80/10/10 split for training/test/validation?

  • How do i create the required labels/classes for these images which are already divided by folders in /images ?

like image 668
Aeryes Avatar asked Jul 29 '18 04:07

Aeryes


People also ask

What is data loader used for PyTorch?

Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.


2 Answers

Looking at the data from Kaggle and your code, there are problems in your data loading.

The data should be in a different folder per class label for PyTorch ImageFolder to load it correctly. In your case, since all the training data is in the same folder, PyTorch is loading it as one train set. You can correct this by using a folder structure like - train/daisy, train/dandelion, test/daisy, test/dandelion and then passing the train and the test folder to the train and test ImageFolder respectively. Just change the folder structure and you should be good. Take a look at the official documentation of torchvision.datasets.Imagefolder which has a similar example.


As you said, these images which are already divided by folders in /images. PyTorch ImageFolder assumes that images are organized in the following way. But this folder structure is only correct if you are using all the images for train set:

```
/images/daisy/100080576_f52e8ee070_n.jpg
/images/daisy/10140303196_b88d3d6cec.jpg
.
.
.
/images/dandelion/10043234166_e6dd915111_n.jpg
/images/dandelion/10200780773_c6051a7d71_n.jpg
```

where 'daisy', 'dandelion' etc. are class labels.

The correct folder structure if you want to split the dataset into train and test set in your case (note that I know you want to split the dataset into train, validation, and test set, but it doesn't matters as this is just an example to get the idea out):

```
/images/train/daisy/100080576_f52e8ee070_n.jpg
/images/train/daisy/10140303196_b88d3d6cec.jpg
.
.
/images/train/dandelion/10043234166_e6dd915111_n.jpg
/images/train/dandelion/10200780773_c6051a7d71_n.jpg
.
.
/images/test/daisy/300080576_f52e8ee070_n.jpg
/images/test/daisy/95140303196_b88d3d6cec.jpg
.
.
/images/test/dandelion/32143234166_e6dd915111_n.jpg
/images/test/dandelion/65200780773_c6051a7d71_n.jpg
```

Then, you can refer to the following full code example on how to write a dataloader:

import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.utils.data as data
import torchvision
from torchvision import transforms

EPOCHS = 2
BATCH_SIZE = 10
LEARNING_RATE = 0.003
TRAIN_DATA_PATH = "./images/train/"
TEST_DATA_PATH = "./images/test/"
TRANSFORM_IMG = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(256),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225] )
    ])

train_data = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM_IMG)
train_data_loader = data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,  num_workers=4)
test_data = torchvision.datasets.ImageFolder(root=TEST_DATA_PATH, transform=TRANSFORM_IMG)
test_data_loader  = data.DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=True, num_workers=4) 

class CNN(nn.Module):
    # omitted...

if __name__ == '__main__':

    print("Number of train samples: ", len(train_data))
    print("Number of test samples: ", len(test_data))
    print("Detected Classes are: ", train_data.class_to_idx) # classes are detected by folder structure

    model = CNN()    
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    loss_func = nn.CrossEntropyLoss()    

    # Training and Testing
    for epoch in range(EPOCHS):        
        for step, (x, y) in enumerate(train_data_loader):
            b_x = Variable(x)   # batch x (image)
            b_y = Variable(y)   # batch y (target)
            output = model(b_x)[0]          
            loss = loss_func(output, b_y)   
            optimizer.zero_grad()           
            loss.backward()                 
            optimizer.step()

            if step % 50 == 0:
                test_x = Variable(test_data_loader)
                test_output, last_layer = model(test_x)
                pred_y = torch.max(test_output, 1)[1].data.squeeze()
                accuracy = sum(pred_y == test_y) / float(test_y.size(0))
                print('Epoch: ', epoch, '| train loss: %.4f' % loss.data[0], '| test accuracy: %.2f' % accuracy)
like image 74
cedrickchee Avatar answered Oct 28 '22 04:10

cedrickchee


There now exists an easy package for the splitting, called 'split-folders'. See here.
E.g.

import splitfolders
splitfolders.ratio(image_path, output="output", seed=43, ratio=(.8,.1,.1))
like image 31
BDeforce Avatar answered Oct 28 '22 04:10

BDeforce