Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pytorch DataLoader multiple data source

I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:

My current code:

class MultipleSourceDataSet(Dataset):
    def __init__ (self, json_file, root_dir, transform = None):
        with open(root_dir + 'block0.json') as f:
            self.result = torch.Tensor(json.load(f))

    self.root_dir = root_dir
    self.transform = transform

    def __len__(self):
        return len(self.result[0])

    def __getitem__ (self):
        None

The data source is 50 blocks under root_dir = ~/Documents/blocks/

I split them and avoid to combine them directly before since this is a very big dataset.

How can I load them into a single dataloader?

like image 548
sealpuppy Avatar asked Dec 24 '22 02:12

sealpuppy


2 Answers

For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.
What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:

import os
import torch.utils.data as data

class SingeJsonDataset(data.Dataset):
    # implement a single json dataset here...

list_of_datasets = []
for j in os.path.listdir(root_dir):
    if not j.endswith('.json'):
        continue  # skip non-json files
    list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
# once all single json datasets are created you can concat them into a single one:
multiple_json_dataset = data.ConcatDataset(list_of_datasets)

Now you can feed the concatenated dataset into data.DataLoader.

like image 148
Shai Avatar answered Jan 31 '23 09:01

Shai


I should revise my question as 2 different sub-questions:

  1. How to deal with large datasets in PyTorch to avoid memory error
  2. If I am separating large a dataset into small chunks, how can I load multiple mini-datasets

    For question 1:

    PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.

    For question 2:

    Please refer to Shai's answer above.

like image 37
sealpuppy Avatar answered Jan 31 '23 07:01

sealpuppy