The torchvision package provides easy access to commonly used datasets. You would use them like this:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
shuffle=False, num_workers=2)
Apparently, you can only switch between train=True
and train=False
. The docs explain:
train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.
But this goes against the common practice of having a three-way split. For serious work, I need another DataLoader
with a validation set. Also, it would be nice to specify the split proportions myself. They don't say what percentage of the dataset is reserved for testing, maybe I would like to change that.
I assume that this is a conscious design decision. Everyone working on one of these datasets is supposed to use the same testset. That makes results comparable. But I still need to get a validation set out of the trainloader
. Is it possible to split a DataLoader
into two separate streams of data?
In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.
Split the dataset We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.
The motivation is quite simple: you should separate your data into train, validation, and test splits to prevent your model from overfitting and to accurately evaluate your model.
Meanwhile, I stumbled upon the method random_split
. So, you don't split the DataLoader
, but you split the Dataset
:
torch.utils.data.random_split(dataset, lengths)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With