Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How I can I lazily read multiple JSON values from a file/stream in Python?

I'd like to read multiple JSON objects from a file/stream in Python, one at a time. Unfortunately json.load() just .read()s until end-of-file; there doesn't seem to be any way to use it to read a single object or to lazily iterate over the objects.

Is there any way to do this? Using the standard library would be ideal, but if there's a third-party library I'd use that instead.

At the moment I'm putting each object on a separate line and using json.loads(f.readline()), but I would really prefer not to need to do this.

Example Use

example.py

import my_json as json
import sys

for o in json.iterload(sys.stdin):
    print("Working on a", type(o))

in.txt

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

example session

$ python3.2 example.py < in.txt
Working on a dict
Working on a int
Working on a int
Working on a list
Working on a int
Working on a int
Working on a int
like image 930
Jeremy Avatar asked Jul 30 '11 22:07

Jeremy


4 Answers

JSON generally isn't very good for this sort of incremental use; there's no standard way to serialise multiple objects so that they can easily be loaded one at a time, without parsing the whole lot.

The object per line solution that you're using is seen elsewhere too. Scrapy calls it 'JSON lines':

  • https://docs.scrapy.org/en/latest/topics/exporters.html?highlight=exporters#jsonitemexporter
  • http://www.enricozini.org/2011/tips/python-stream-json/

You can do it slightly more Pythonically:

for jsonline in f:
    yield json.loads(jsonline)   # or do the processing in this loop

I think this is about the best way - it doesn't rely on any third party libraries, and it's easy to understand what's going on. I've used it in some of my own code as well.

like image 142
Thomas K Avatar answered Nov 17 '22 09:11

Thomas K


A little late maybe, but I had this exact problem (well, more or less). My standard solution for these problems is usually to just do a regex split on some well-known root object, but in my case it was impossible. The only feasible way to do this generically is to implement a proper tokenizer.

After not finding a generic-enough and reasonably well-performing solution, I ended doing this myself, writing the splitstream module. It is a pre-tokenizer that understands JSON and XML and splits a continuous stream into multiple chunks for parsing (it leaves the actual parsing up to you though). To get some kind of performance out of it, it is written as a C module.

Example:

from splitstream import splitfile

for jsonstr in splitfile(sys.stdin, format="json")):
    yield json.loads(jsonstr)
like image 21
Krumelur Avatar answered Nov 17 '22 08:11

Krumelur


Sure you can do this. You just have to take to raw_decode directly. This implementation loads the whole file into memory and operates on that string (much as json.load does); if you have large files you can modify it to only read from the file as necessary without much difficulty.

import json
from json.decoder import WHITESPACE

def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
    if isinstance(string_or_fp, file):
        string = string_or_fp.read()
    else:
        string = str(string_or_fp)

    decoder = cls(**kwargs)
    idx = WHITESPACE.match(string, 0).end()
    while idx < len(string):
        obj, end = decoder.raw_decode(string, idx)
        yield obj
        idx = WHITESPACE.match(string, end).end()

Usage: just as you requested, it's a generator.

like image 27
Jeremy Roman Avatar answered Nov 17 '22 08:11

Jeremy Roman


Here's a much, much simpler solution. The secret is to try, fail, and use the information in the exception to parse correctly. The only limitation is the file must be seekable.

def stream_read_json(fn):
    import json
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                start_pos += e.pos
                yield obj

Edit: just noticed that this will only work for Python >=3.5. For earlier, failures return a ValueError, and you have to parse out the position from the string, e.g.

def stream_read_json(fn):
    import json
    import re
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except ValueError as e:
                f.seek(start_pos)
                end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)',
                                    e.args[0]).groups()[0])
                json_str = f.read(end_pos)
                obj = json.loads(json_str)
                start_pos += end_pos
                yield obj
like image 26
Nic Watson Avatar answered Nov 17 '22 10:11

Nic Watson