How to handle large JSON file in Pytorch?

Question

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?

roman · Accepted Answer

I suppose IterableDataset (docs) is what you need, because:

you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.

I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.

import json
from torch.utils.data import DataLoader, IterableDataset


class JsonDataset(IterableDataset):
    def __init__(self, files):
        self.files = files

    def __iter__(self):
        for json_file in self.files:
            with open(json_file) as f:
                for sample_line in f:
                    sample = json.loads(sample_line)
                    yield sample['x'], sample['time'], ...

...

dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)

for batch in dataloader:
    y = model(batch)

How to handle large JSON file in Pytorch?

Tags:

deep-learning

time-series

pytorch

Shamane Siriwardhana

1 Answers

roman

Recent Activity

Donate For Us

How to handle large JSON file in Pytorch?

Tags:

deep-learning

time-series

pytorch

Shamane Siriwardhana

1 Answers

roman

Related questions

Recent Activity

Donate For Us