Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch: Testing with torchvision.datasets.ImageFolder and DataLoader

Tags:

python

pytorch

I'm a newbie trying to make this PyTorch CNN work with the Cats&Dogs dataset from kaggle. As there are no targets for the test images, I manually classified some of the test images and put the class in the filename, to be able to test (maybe should have just used some of the train images).

I used the torchvision.datasets.ImageFolder class to load the train and test images. The training seems to work.

But what do I need to do to make the test-routine work? I don't know, how to connect my test_data_loader with the test loop at the bottom, via test_x and test_y.

The Code is based on this MNIST example CNN. There, something like this is used right after the loaders are created. But I failed to rewrite it for my dataset:

test_x = Variable(torch.unsqueeze(test_data.test_data, dim=1), volatile=True).type(torch.FloatTensor)[:2000]/255.   # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1)
test_y = test_data.test_labels[:2000]

The Code:

import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.utils.data as data
import torchvision
from torchvision import transforms

EPOCHS = 2
BATCH_SIZE = 10
LEARNING_RATE = 0.003
TRAIN_DATA_PATH = "./train_cl/"
TEST_DATA_PATH = "./test_named_cl/"
TRANSFORM_IMG = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(256),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225] )
    ])

train_data = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM_IMG)
train_data_loader = data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,  num_workers=4)
test_data = torchvision.datasets.ImageFolder(root=TEST_DATA_PATH, transform=TRANSFORM_IMG)
test_data_loader  = data.DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=True, num_workers=4) 

class CNN(nn.Module):
    # omitted...

if __name__ == '__main__':

    print("Number of train samples: ", len(train_data))
    print("Number of test samples: ", len(test_data))
    print("Detected Classes are: ", train_data.class_to_idx) # classes are detected by folder structure

    model = CNN()    
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    loss_func = nn.CrossEntropyLoss()    

    # Training and Testing
    for epoch in range(EPOCHS):        
        for step, (x, y) in enumerate(train_data_loader):
            b_x = Variable(x)   # batch x (image)
            b_y = Variable(y)   # batch y (target)
            output = model(b_x)[0]          
            loss = loss_func(output, b_y)   
            optimizer.zero_grad()           
            loss.backward()                 
            optimizer.step()

            # Test -> this is where I have no clue
            if step % 50 == 0:
                test_x = Variable(test_data_loader)
                test_output, last_layer = model(test_x)
                pred_y = torch.max(test_output, 1)[1].data.squeeze()
                accuracy = sum(pred_y == test_y) / float(test_y.size(0))
                print('Epoch: ', epoch, '| train loss: %.4f' % loss.data[0], '| test accuracy: %.2f' % accuracy)
like image 504
kett Avatar asked Mar 02 '18 16:03

kett


People also ask

What does ImageFolder do in PyTorch?

ImageFolder class: Responsible for loading images from train and val folders into a PyTorch dataset. DataLoader class: Enables us to wrap an iterable around our dataset so that data samples can be efficiently accessed by our deep learning model.

What is the difference between a PyTorch dataset and a PyTorch dataloader?

Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

Why do we use dataloader in PyTorch?

Creating a PyTorch Dataset and managing it with Dataloader keeps your data manageable and helps to simplify your machine learning pipeline. a Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more.


1 Answers

Looking at the data from Kaggle and your code, it seems that there are problems in your data loading, both train and test set. First of all, the data should be in a different folder per label for the default PyTorch ImageFolder to load it correctly. In your case, since all the training data is in the same folder, PyTorch is loading it as one class and hence learning seems to be working. You can correct this by using a folder structure like - train/dog, - train/cat, - test/dog, - test/cat and then passing the train and the test folder to the train and test ImageFolder respectively. The training code seems fine, just change the folder structure and you should be good. Take a look at the official documentation of ImageFolder which has a similar example.

like image 111
Monster Avatar answered Sep 16 '22 17:09

Monster