Is it possible to split the training DataLoader (and dataset) into training and validation datasets?

Tags:

The torchvision package provides easy access to commonly used datasets. You would use them like this:

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

Apparently, you can only switch between train=True and train=False. The docs explain:

train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.

But this goes against the common practice of having a three-way split. For serious work, I need another DataLoader with a validation set. Also, it would be nice to specify the split proportions myself. They don't say what percentage of the dataset is reserved for testing, maybe I would like to change that.

I assume that this is a conscious design decision. Everyone working on one of these datasets is supposed to use the same testset. That makes results comparable. But I still need to get a validation set out of the trainloader. Is it possible to split a DataLoader into two separate streams of data?

914

asked Nov 20 '18 08:11

lhk

1 Answers

Meanwhile, I stumbled upon the method random_split. So, you don't split the DataLoader, but you split the Dataset:

torch.utils.data.random_split(dataset, lengths)

185

answered Oct 02 '22 22:10

lhk

Related questions
                            
                                Object of type "datetime.date" has no len ()" in python
                            
                                Virtualenv doesn't use right version of Python
                            
                                How to terminate long-running computation (CPU bound task) in Python using asyncio and concurrent.futures.ProcessPoolExecutor?
                            
                                google colab setting a '^C' in the proccess
                            
                                Pandas and DateTime TypeError: cannot compare a TimedeltaIndex with type float
                            
                                How do i click an element using selenium from a long drop down list?
                            
                                Why does math.isclose() fail to detect minor differences between very large values?
                            
                                Pass command line arguments to test modules
                            
                                pip failling to install for Python 3.7 on MacOs
                            
                                deploying the Tensorflow model in Python
                            
                                Pandas: Group by bi-monthly date field
                            
                                Pandas - Replace other columns in row with 0 if a specific column has a value of 1
                            
                                django.core.exceptions.SuspiciousFileOperation: The joined path is located outside of the base path component
                            
                                My for loop isn't removing items in my array based on condition? Python [duplicate]
                            
                                Python Marshmallow: Dict validation Error
                            
                                PyTorch gradient differs from manually calculated gradient
                            
                                Why cannot python PIL show two images in one program
                            
                                Why do I receive an AttributeError even though import, spelling and file location is correct?
                            
                                Scrapy - Use feed exporter for a particular spider (and not others) in a project
                            
                                Python redirect (with delay)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to split the training DataLoader (and dataset) into training and validation datasets?

Tags:

python

pytorch

torch

lhk

People also ask

1 Answers

lhk

Recent Activity

Donate For Us