Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle large JSON file in Pytorch?

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?

like image 919
Shamane Siriwardhana Avatar asked Oct 20 '25 02:10

Shamane Siriwardhana


1 Answers

I suppose IterableDataset (docs) is what you need, because:

  1. you probably want to traverse files without random access;
  2. number of samples in jsons is not pre-computed.

I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.

import json
from torch.utils.data import DataLoader, IterableDataset


class JsonDataset(IterableDataset):
    def __init__(self, files):
        self.files = files

    def __iter__(self):
        for json_file in self.files:
            with open(json_file) as f:
                for sample_line in f:
                    sample = json.loads(sample_line)
                    yield sample['x'], sample['time'], ...

...

dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)

for batch in dataloader:
    y = model(batch)
like image 175
roman Avatar answered Oct 21 '25 21:10

roman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!