I have about 100,000 JSON files with this structure:
{'images': [<list of dicts>],
'annotations': [<list of dicts>],
'videos': [<list of dicts>]}
Each JSON varies in size, but the average is about 2MB. I have a lot of RAM (488GB), but still seem to only fit about 70% of these into memory.
What would be the fastest way in Python to combine these into a single JSON file (with the same three keys, where the lists are combined into single large lists)?
I considered looping through all of them 3x (one loop for each key) and appending to a file, but this would be very slow. I'm unsure if there exists a better way.
Here's how I attempted to load them all at once (which slows down and then fails before completing):
from glob import glob
import json
from tqdm import tqdm
full = {
'videos': [],
'images': [],
'annotations': []
}
for fp in tqdm(glob('coco_parts/*.json')):
with open(fp, 'r') as f:
single = json.load(f)
full['videos'] += single['videos']
full['images'] += single['images']
full['annotations'] += single['annotations']
I don't have enough reputation to comment so I will leave this here as an answer.
The fact that you can't store these files in memory although this shouldn't be a problem for your pc may be due to the overhead added by the python objects you're using:
{'images': [<list of dicts>],
'annotations': [<list of dicts>],
'videos': [<list of dicts>]}
One idea may be to switch to using something else like a single string (preserving the correct JSON structure) or using pandas/numpy just like these articles suggest: article, article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With