Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a large JSON file efficiently in Python?

Tags:

python

json

I have a file that contains an array of JSON objects. The file is over 1GB, so I can't load it into memory all at once. I need to parse each of the individual objects. I tried using ijson, but that will load the entire array as one object, effectively doing the same thing as a simple json.load() would.

Is there another way how to do it?

Edit: Nevermind, just use ijson.items() and set the prefix parameter to "item".

like image 725
Kristína Avatar asked Jun 08 '26 17:06

Kristína


1 Answers

You can parse the JSON file once to find the positions of each level-1 separator, i.e. a comma that is part of the top-level object, and then divide the file into sections indicated by these positions. For example:

{"a": [1, 2, 3], "b": "Hello, World!", "c": {"d": 4, "e": 5}}
        ^      ^            ^        ^             ^
        |      |            |        |             |
     level-2   |         quoted      |          level-2
               |                     |
            level-1               level-1

Here we want to find the level-1 commas, that separate the objects which are contained by the top-level object. We can use a generator which parses the JSON stream and keeps track of descending into and stepping out of nested objects. When it encounters a level-1 comma that is not quoted it yields the corresponding position:

def find_sep_pos(stream, *, sep=','):
    level = 0
    quoted = False  # handling strings in the json
    backslash = False  # handling quoted quotes
    for pos, char in enumerate(stream):
        if backslash:
            backslash = False
        elif char in '{[':
            level += 1
        elif char in ']}':
            level -= 1
        elif char == '"':
            quoted = not quoted
        elif char == '\\':
            backslash = True
        elif char == sep and not quoted and level == 1:
            yield pos

Used on the example data above, this would give list(find_sep_pos(example)) == [15, 37].

Then we can divide the file into sections that correspond to the separator positions and load each section individually via json.loads:

import itertools as it
import json

with open('example.json') as fh:
    # Iterating over `fh` yields lines, so we chain them in order to get characters.
    sep_pos = tuple(find_sep_pos(it.chain.from_iterable(fh)))
    fh.seek(0)  # reset to the beginning of the file
    stream = it.chain.from_iterable(fh)
    opening_bracket = next(stream)
    closing_bracket = dict(('{}', '[]'))[opening_bracket]
    offset = 1  # the bracket we just consumed adds an offset of 1
    for pos in sep_pos:
        json_str = (
            opening_bracket
            + ''.join(it.islice(stream, pos - offset))
            + closing_bracket
        )
        obj = json.loads(json_str)  # this is your object
        next(stream)  # step over the separator
        offset = pos + 1  # adjust where we are in the stream right now
        print(obj)
    # The last object still remains in the stream, so we load it here.
    obj = json.loads(opening_bracket + ''.join(stream))
    print(obj)
like image 53
a_guest Avatar answered Jun 11 '26 07:06

a_guest



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!