Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to combine many JSON files that don't fit in memory

I have about 100,000 JSON files with this structure:

{'images': [<list of dicts>],
 'annotations': [<list of dicts>],
 'videos': [<list of dicts>]}

Each JSON varies in size, but the average is about 2MB. I have a lot of RAM (488GB), but still seem to only fit about 70% of these into memory.

What would be the fastest way in Python to combine these into a single JSON file (with the same three keys, where the lists are combined into single large lists)?


I considered looping through all of them 3x (one loop for each key) and appending to a file, but this would be very slow. I'm unsure if there exists a better way.

Here's how I attempted to load them all at once (which slows down and then fails before completing):

from glob import glob
import json
from tqdm import tqdm

full = {
    'videos': [],
    'images': [],
    'annotations': []
}

for fp in tqdm(glob('coco_parts/*.json')):
    with open(fp, 'r') as f:
        single = json.load(f)
        full['videos'] += single['videos']
        full['images'] += single['images']
        full['annotations'] += single['annotations']
like image 773
Austin Avatar asked Aug 30 '25 17:08

Austin


1 Answers

I don't have enough reputation to comment so I will leave this here as an answer.

The fact that you can't store these files in memory although this shouldn't be a problem for your pc may be due to the overhead added by the python objects you're using:

{'images': [<list of dicts>],

'annotations': [<list of dicts>],

'videos': [<list of dicts>]}

One idea may be to switch to using something else like a single string (preserving the correct JSON structure) or using pandas/numpy just like these articles suggest: article, article.

like image 163
Charbel abi daher Avatar answered Sep 02 '25 06:09

Charbel abi daher