I want to apply cross validation in Pytorch using skorch, so I prepared my model and my tensorDataset which returns (image,caption and captions_length) and so it has X and Y, so I'll not be able to set Y in the method
net.fit(dataset)
but when I tried that I got an error :
ValueError: Stratified CV requires explicitly passing a suitable y
Here's part of my code:
start = time.time()
net = NeuralNetClassifier(
decoder, criterion= nn.CrossEntropyLoss,
max_epochs=args.epochs,
lr=args.lr,
optimizer=optim.SGD,
device='cuda', # uncomment this to train with CUDA
)
net.fit(dataset, y=None)
end = time.time()
In PyTorch, we have the concept of a Datasetand a DataLoader. The former is purely the container of the data and only needs to implement __len__()and __getitem__(<int>). The latter does the heavy lifting, such as sampling, shuffling, and distributed processing.
skorch uses the PyTorch DataLoaders by default. skorch supports PyTorch’s Datasetwhen calling fit()or partial_fit(). Details on how to use PyTorch’s Datasetwith skorch, can be found in How do I use a PyTorch Dataset with skorch?. In order to support other data formats, we provide our own Datasetclass that is compatible with: numpy.ndarrays
Creating a PyTorch Dataset and managing it with Dataloader keeps your data manageable and helps to simplify your machine learning pipeline. a Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more. Pandas is not essential to create a Dataset object.
The Datasetfrom skorch makes the assumption that you always have an Xand a y, where Xrepresents the input data and ythe target. However, you may leave y=None, in which case Datasetreturns a dummy variable.
You are (implicitly) using the internal CV split of skorch which uses a stratified split in case of the NeuralNetClassifier
which in turn needs information about the labels beforehand.
When passing X
and y
to fit
separately this works fine since y
is accessible at all times. The problem is that you are using torch.dataset.Dataset
which is lazy and does not give you access to y
directly, hence the error.
Your options are the following.
train_split=None
to disable the internal CV splitnet = NeuralNetClassifier(
train_split=None,
)
You will lose internal validation and, as such, features like early stopping.
Split your dataset into two datasets, dataset_train
and dataset_valid
,
then use skorch.helper.predefined_split
:
net = NeuralNetClassifier(
train_split=predefined_split(dataset_valid),
)
You lose nothing but depending on your data this might be complicated.
y
and pass it to fity_train = np.array([y for X, y in iter(my_dataset)])
net.fit(my_dataset, y=y_train)
This only works if your y
fits into memory. Since you are using TensorDataset
you can also do the following to extract your y
:
y_train = my_dataset.y
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With