Train-Valid-Test split for custom dataset using PyTorch and TorchVision

Tags:

I have some image data for a binary classification task and the images are organised into 2 folders as data/model_data/class-A and data/model_data/class-B.

There are a total of N images. I want to have a 70/20/10 split for train/val/test. I am using PyTorch and Torchvision for the task. Here is the code I have so far.

from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils, datasets, models

data_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

model_dataset = datasets.ImageFolder(root, transform=data_transform) 
train_count = int(0.7 * total_count) 
valid_count = int(0.2 * total_count)
test_count = total_count - train_count - valid_count
train_dataset, valid_dataset, test_dataset = torch.utils.data.random_split(model_dataset, (train_count, valid_count, test_count))
train_dataset_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKER)  
valid_dataset_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKER) 
test_dataset_loader  = torch.utils.data.DataLoader(test_dataset , batch_size=BATCH_SIZE, shuffle=False,num_workers=NUM_WORKER)
dataloaders = {'train': train_dataset_loader, 'val': valid_dataset_loader, 'test': test_dataset_loader}

I feel that this isn't the correct way to be doing this because of 2 reasons.

I am applying the same transform to all the splits. (This is not what I want to do, obviously! The solution for this is most probably the answer here.)
Usually people first separate the original data into test/train and then they separate train into train/val, whereas I am directly separating the original data into train/val/test. (Is this correct?)

So, my question is, is what I am doing correct? (Probably not)
And if it is not correct, how do I go about writing the data loaders to achieve the required splits, so that I can apply separate transforms to each of train/test/val?

403

asked May 15 '20 04:05

iamshnoo

1 Answers

Usually people first separate the original data into test/train and then they separate train into train/val, whereas I am directly separating the original data into train/val/test. (Is this correct?)

Yes, it's fully correct, readable and totally fine all in all

I am applying the same transform to all the splits. (This is not what I want to do, obviously! The solution for this is most probably the answer here.)

Yes, that answer is a possibility but it's pointlessly verbose tbh. You can use third party tool torchdata, simply instalable with:

pip install torchdata

Documentation can be found here (also disclaimer: I'm the author).

It allows you to map your transformations to any torch.utils.data.Dataset easily (in this case to train). Your code would look like that (only two lines have to change, check the comments, also formatted your code to follow it easier):

import torch
import torchvision

import torchdata as td

data_transform = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomResizedCrop(224),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
        ),
    ]
)

# Single change, makes an instance of torchdata.Dataset
# Works just like PyTorch's torch.utils.data.Dataset, but has
# additional capabilities like .map, cache etc., see project's description
model_dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder(root))
# Also you shouldn't use transforms here but below
train_count = int(0.7 * total_count)
valid_count = int(0.2 * total_count)
test_count = total_count - train_count - valid_count
train_dataset, valid_dataset, test_dataset = torch.utils.data.random_split(
    model_dataset, (train_count, valid_count, test_count)
)

# Apply transformations here only for train dataset

train_dataset = train_dataset.map(data_transform)

# Rest of the code goes the same

train_dataset_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKER
)
valid_dataset_loader = torch.utils.data.DataLoader(
    valid_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKER
)
test_dataset_loader = torch.utils.data.DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKER
)
dataloaders = {
    "train": train_dataset_loader,
    "val": valid_dataset_loader,
    "test": test_dataset_loader,
}

And yeah, I agree that specifying transform before splitting isn't too clear and IMO this is way more readable.

188

answered Sep 18 '22 13:09

Szymon Maszke

Related questions
                            
                                Read what number the colored number image is to console
                            
                                Optimizing cartesian product between two Pandas Dataframe
                            
                                Seaborn catplot combined with PairGrid
                            
                                Looking for different methods of image segmentation for pictures of apples
                            
                                Query to Snowflake database isn't working because no active warehouse is selected
                            
                                JupyterLab: How to clear output of current cell using a keyboard shortcut?
                            
                                Typing interfaces
                            
                                Hours and minutes as labels in Altair plot spanning more than one day
                            
                                Why is performance so much better with zarr than parquet when using dask?
                            
                                error with module multiprocessing under python3.8
                            
                                HTTPError: HTTP Error 403: Forbidden on Google Colab
                            
                                Can autograd in pytorch handle a repeated use of a layer within the same module?
                            
                                Django 3.x error: 'mysql.connector.django' isn't an available database backend
                            
                                Django - Use model fields part in a create or update query
                            
                                How to hide file paths when running Python scripts in VS Code?
                            
                                How to multiply certain values of a column by a constant?
                            
                                While converting a PIL image into a tensor why the pixels are changing?
                            
                                How to automatically pull the latest commit from a git submodule on Heroku?
                            
                                Why does unpacking give a list instead of a tuple in Python?
                            
                                An analog to rnorm in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Train-Valid-Test split for custom dataset using PyTorch and TorchVision

Tags:

python

pytorch

torchvision

iamshnoo

People also ask

1 Answers

Szymon Maszke

Recent Activity

Donate For Us