Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to split the training DataLoader (and dataset) into training and validation datasets?

The torchvision package provides easy access to commonly used datasets. You would use them like this:

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

Apparently, you can only switch between train=True and train=False. The docs explain:

train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.

But this goes against the common practice of having a three-way split. For serious work, I need another DataLoader with a validation set. Also, it would be nice to specify the split proportions myself. They don't say what percentage of the dataset is reserved for testing, maybe I would like to change that.

I assume that this is a conscious design decision. Everyone working on one of these datasets is supposed to use the same testset. That makes results comparable. But I still need to get a validation set out of the trainloader. Is it possible to split a DataLoader into two separate streams of data?

like image 914
lhk Avatar asked Nov 20 '18 08:11

lhk


People also ask

How do you split your data between training and validation?

In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.

How do you split data into training and testing and validation in Python?

Split the dataset We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.

Why should a dataset be split between training test and validation?

The motivation is quite simple: you should separate your data into train, validation, and test splits to prevent your model from overfitting and to accurately evaluate your model.


1 Answers

Meanwhile, I stumbled upon the method random_split. So, you don't split the DataLoader, but you split the Dataset:

torch.utils.data.random_split(dataset, lengths)
like image 185
lhk Avatar answered Oct 02 '22 22:10

lhk