Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing tensorDataset or Dataloader to skorch

I want to apply cross validation in Pytorch using skorch, so I prepared my model and my tensorDataset which returns (image,caption and captions_length) and so it has X and Y, so I'll not be able to set Y in the method

net.fit(dataset)

but when I tried that I got an error :

ValueError: Stratified CV requires explicitly passing a suitable y

Here's part of my code:

start = time.time()
net = NeuralNetClassifier(
        decoder, criterion= nn.CrossEntropyLoss,
        max_epochs=args.epochs,
        lr=args.lr,
        optimizer=optim.SGD,
        device='cuda',  # uncomment this to train with CUDA
       )
net.fit(dataset, y=None)
end = time.time()
like image 253
Omar Abdelaziz Avatar asked Jun 07 '19 08:06

Omar Abdelaziz


People also ask

What is a dataloader in PyTorch?

In PyTorch, we have the concept of a Datasetand a DataLoader. The former is purely the container of the data and only needs to implement __len__()and __getitem__(<int>). The latter does the heavy lifting, such as sampling, shuffling, and distributed processing.

Does Skorch support PyTorch dataset?

skorch uses the PyTorch DataLoaders by default. skorch supports PyTorch’s Datasetwhen calling fit()or partial_fit(). Details on how to use PyTorch’s Datasetwith skorch, can be found in How do I use a PyTorch Dataset with skorch?. In order to support other data formats, we provide our own Datasetclass that is compatible with: numpy.ndarrays

What is the difference between dataset and dataloader?

Creating a PyTorch Dataset and managing it with Dataloader keeps your data manageable and helps to simplify your machine learning pipeline. a Dataset stores all your data, and Dataloader is can be used to iterate through the data, manage batches, transform the data, and much more. Pandas is not essential to create a Dataset object.

What does the dataset from Skorch return?

The Datasetfrom skorch makes the assumption that you always have an Xand a y, where Xrepresents the input data and ythe target. However, you may leave y=None, in which case Datasetreturns a dummy variable.


1 Answers

You are (implicitly) using the internal CV split of skorch which uses a stratified split in case of the NeuralNetClassifier which in turn needs information about the labels beforehand.

When passing X and y to fit separately this works fine since y is accessible at all times. The problem is that you are using torch.dataset.Dataset which is lazy and does not give you access to y directly, hence the error.

Your options are the following.

Set train_split=None to disable the internal CV split

net = NeuralNetClassifier(
    train_split=None,
)

You will lose internal validation and, as such, features like early stopping.

Split your data beforehand

Split your dataset into two datasets, dataset_train and dataset_valid, then use skorch.helper.predefined_split:

net = NeuralNetClassifier(
    train_split=predefined_split(dataset_valid),
)

You lose nothing but depending on your data this might be complicated.

Extract your y and pass it to fit

y_train = np.array([y for X, y in iter(my_dataset)])
net.fit(my_dataset, y=y_train)

This only works if your y fits into memory. Since you are using TensorDataset you can also do the following to extract your y:

y_train = my_dataset.y
like image 125
nemo Avatar answered Oct 03 '22 00:10

nemo