Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pytorch - Concatenating Datasets before using Dataloader

I am trying to load two datasets and use them both for training.

Package versions: python 3.7; pytorch 1.3.1

It is possible to create data_loaders seperately and train on them sequentially:

from torch.utils.data import DataLoader, ConcatDataset


train_loader_modelnet = DataLoader(ModelNet(args.modelnet_root, categories=args.modelnet_categories,split='train', transform=transform_modelnet, device=args.device),batch_size=args.batch_size, shuffle=True)

train_loader_mydata = DataLoader(MyDataset(args.customdata_root, categories=args.mydata_categories, split='train', device=args.device),batch_size=args.batch_size, shuffle=True)

for e in range(args.epochs):
    for idx, batch in enumerate(tqdm(train_loader_modelnet)):
        # training on dataset1
    for idx, batch in enumerate(tqdm(train_loader_custom)):
        # training on dataset2

Note: MyDataset is a custom dataset class which has def __len__(self): def __getitem__(self, index): implemented. As the above configuration works it seems that this is implementation is OK.

But I would ideally like to combine them into a single dataloader object. I attempted this as per the pytorch documentation:

train_modelnet = ModelNet(args.modelnet_root, categories=args.modelnet_categories,
                          split='train', transform=transform_modelnet, device=args.device)
train_mydata = CloudDataset(args.customdata_root, categories=args.mydata_categories,
                             split='train', device=args.device)
train_loader = torch.utils.data.ConcatDataset(train_modelnet, train_customdata)

for e in range(args.epochs):
    for idx, batch in enumerate(tqdm(train_loader)):
        # training on combined

However, on random batches I get the following 'expected a tensor as element X in argument 0, but got a tuple instead' type of error. Any help would be much appreciated!

>   40%|████      | 53/131 [01:03<02:00,  1.55s/it]
>  Traceback (mostrecent call last):   File
> "/home/chris/Programs/pycharm-anaconda-2019.3.4/plugins/python/helpers/pydev/pydevd.py",
> line 1434, in _exec
>     pydev_imports.execfile(file, globals, locals)  # execute the script   File
> "/home/chris/Programs/pycharm-anaconda-2019.3.4/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
>     exec(compile(contents+"\n", file, 'exec'), glob, loc)   File "/home/chris/Documents/4yp/Data/my_kaolin/Classification/pointcloud_classification_combinedset.py",
> line 83, in <module>
>     for idx, batch in enumerate(tqdm(train_loader)):   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/tqdm/std.py",
> line 1107, in __iter__
>     for obj in iterable:   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/dataloader.py",
> line 346, in __next__
>     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration   File
> "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py",
> line 47, in fetch
>     return self.collate_fn(data)   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py",
> line 79, in default_collate
>     return [default_collate(samples) for samples in transposed]   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py",
> line 79, in <listcomp>
>     return [default_collate(samples) for samples in transposed]   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py",
> line 55, in default_collate
>     return torch.stack(batch, 0, out=out) TypeError: expected Tensor as element 3 in argument 0, but got tuple  

like image 941
chrispduck Avatar asked Dec 11 '25 22:12

chrispduck


2 Answers

If I got your question right, you have train and dev sets (and their corresponding loaders) as follows:

train_set = CustomDataset(...)
train_loader = DataLoader(dataset=train_set, ...)
dev_set = CustomDataset(...)
dev_loader = DataLoader(dataset=dev_set, ...)

And you want to concatenate them in order to use train+dev as the training data, right? If so, you just simply call:

train_dev_sets = torch.utils.data.ConcatDataset([train_set, dev_set])
train_dev_loader = DataLoader(dataset=train_dev_sets, ...)

The train_dev_loader is the loader containing data from both sets.

Now, be sure your data has the same shapes and the same types, that is, the same number of features, or the same categories/numbers, etc.

like image 59
jvel07 Avatar answered Dec 14 '25 12:12

jvel07


I'd guess the two datasets are sometimes returning different types. When the data are Tensors, torch stacks them, and they better be the same shape. If they're something like strings, torch will make a tuple out of them. So this sounds like one of your datasets is sometimes returning something that's not a tensor. I'd put some asserts on the output of your dataset to check that it's doing what you want, or dive in with pdb.

like image 29
Leopd Avatar answered Dec 14 '25 12:12

Leopd