Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSON-encoding very long iterators

Tags:

python

json

I am writing a web service which returns objects containing very long lists, encoded in JSON. Of course we want to use iterators rather than Python lists so we can stream the objects from a database; unfortunately, the JSON encoder in the standard library (json.JSONEncoder) only accepts lists and tuples to be converted to JSON lists (though _iterencode_list looks like it would actually work on any iterable).

The docstrings suggest overriding default to convert the object to a list, but this means we lose the benefits of streaming. Previously, we overrode a private method, but (as could have been expected) that broke when the encoder was refactored.

What is the best way to serialize iterators as JSON lists in Python in a streaming way?

like image 846
Max Avatar asked Oct 01 '12 09:10

Max


2 Answers

I needed exactly this. First approach was override the JSONEncoder.iterencode() method. However this does not work because as soon as the iterator is not toplevel, the internals of some _iterencode() function take over.

After some studying of the code, I found a very hacky solution, but it works. Python 3 only, but I'm sure the same magic is possible with python 2 (just other magic-method names):

import collections.abc
import json
import itertools
import sys
import resource
import time
starttime = time.time()
lasttime = None


def log_memory():
    if "linux" in sys.platform.lower():
        to_MB = 1024
    else:
        to_MB = 1024 * 1024
    print("Memory: %.1f MB, time since start: %.1f sec%s" % (
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / to_MB,
        time.time() - starttime,
        "; since last call: %.1f sec" % (time.time() - lasttime) if lasttime
        else "",
    ))
    globals()["lasttime"] = time.time()


class IterEncoder(json.JSONEncoder):
    """
    JSON Encoder that encodes iterators as well.
    Write directly to file to use minimal memory
    """
    class FakeListIterator(list):
        def __init__(self, iterable):
            self.iterable = iter(iterable)
            try:
                self.firstitem = next(self.iterable)
                self.truthy = True
            except StopIteration:
                self.truthy = False

        def __iter__(self):
            if not self.truthy:
                return iter([])
            return itertools.chain([self.firstitem], self.iterable)

        def __len__(self):
            raise NotImplementedError("Fakelist has no length")

        def __getitem__(self, i):
            raise NotImplementedError("Fakelist has no getitem")

        def __setitem__(self, i):
            raise NotImplementedError("Fakelist has no setitem")

        def __bool__(self):
            return self.truthy

    def default(self, o):
        if isinstance(o, collections.abc.Iterable):
            return type(self).FakeListIterator(o)
        return super().default(o)

print(json.dumps((i for i in range(10)), cls=IterEncoder))
print(json.dumps((i for i in range(0)), cls=IterEncoder))
print(json.dumps({"a": (i for i in range(10))}, cls=IterEncoder))
print(json.dumps({"a": (i for i in range(0))}, cls=IterEncoder))


log_memory()
print("dumping 10M numbers as incrementally")
with open("/dev/null", "wt") as fp:
    json.dump(range(10000000), fp, cls=IterEncoder)
log_memory()
print("dumping 10M numbers built in encoder")
with open("/dev/null", "wt") as fp:
    json.dump(list(range(10000000)), fp)
log_memory()

Results:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]
{"a": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
{"a": []}
Memory: 8.4 MB, time since start: 0.0 sec
dumping 10M numbers as incrementally
Memory: 9.0 MB, time since start: 8.6 sec; since last call: 8.6 sec
dumping 10M numbers built in encoder
Memory: 395.5 MB, time since start: 17.1 sec; since last call: 8.5 sec

It's clear to see that the IterEncoder does not need the memory to store 10M ints, while keeping the same encoding speed.

The (hacky) trick is that the _iterencode_list actually doesn't need any of the list things. It just wants to know if the list is empty (__bool__) and then get its iterator. However it only gets to this code when isinstance(x, (list, tuple)) returns True. So I'm packaging the iterator into a list-subclass, then disabling all the random-access, getting the first element up front so that I know whether it's empty or not, and feeding back the iterator. Then the default method returns this fake list in case of an iterator.

like image 60
Claude Avatar answered Oct 30 '22 14:10

Claude


Save this into a module file and import it or paste it directly into your code.

'''
Copied from Python 2.7.8 json.encoder lib, diff follows:
@@ -331,6 +331,8 @@
                     chunks = _iterencode(value, _current_indent_level)
                 for chunk in chunks:
                     yield chunk
+        if first:
+            yield buf
         if newline_indent is not None:
             _current_indent_level -= 1
             yield '\n' + (' ' * (_indent * _current_indent_level))
@@ -427,12 +429,12 @@
             yield str(o)
         elif isinstance(o, float):
             yield _floatstr(o)
-        elif isinstance(o, (list, tuple)):
-            for chunk in _iterencode_list(o, _current_indent_level):
-                yield chunk
         elif isinstance(o, dict):
             for chunk in _iterencode_dict(o, _current_indent_level):
                 yield chunk
+        elif hasattr(o, '__iter__'):
+            for chunk in _iterencode_list(o, _current_indent_level):
+                yield chunk
         else:
             if markers is not None:
                 markerid = id(o)
'''
from json import encoder

def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
        _key_separator, _item_separator, _sort_keys, _skipkeys, _one_shot,
        ## HACK: hand-optimized bytecode; turn globals into locals
        ValueError=ValueError,
        basestring=basestring,
        dict=dict,
        float=float,
        id=id,
        int=int,
        isinstance=isinstance,
        list=list,
        long=long,
        str=str,
        tuple=tuple,
    ):

    def _iterencode_list(lst, _current_indent_level):
        if not lst:
            yield '[]'
            return
        if markers is not None:
            markerid = id(lst)
            if markerid in markers:
                raise ValueError("Circular reference detected")
            markers[markerid] = lst
        buf = '['
        if _indent is not None:
            _current_indent_level += 1
            newline_indent = '\n' + (' ' * (_indent * _current_indent_level))
            separator = _item_separator + newline_indent
            buf += newline_indent
        else:
            newline_indent = None
            separator = _item_separator
        first = True
        for value in lst:
            if first:
                first = False
            else:
                buf = separator
            if isinstance(value, basestring):
                yield buf + _encoder(value)
            elif value is None:
                yield buf + 'null'
            elif value is True:
                yield buf + 'true'
            elif value is False:
                yield buf + 'false'
            elif isinstance(value, (int, long)):
                yield buf + str(value)
            elif isinstance(value, float):
                yield buf + _floatstr(value)
            else:
                yield buf
                if isinstance(value, (list, tuple)):
                    chunks = _iterencode_list(value, _current_indent_level)
                elif isinstance(value, dict):
                    chunks = _iterencode_dict(value, _current_indent_level)
                else:
                    chunks = _iterencode(value, _current_indent_level)
                for chunk in chunks:
                    yield chunk
        if first:
            yield buf
        if newline_indent is not None:
            _current_indent_level -= 1
            yield '\n' + (' ' * (_indent * _current_indent_level))
        yield ']'
        if markers is not None:
            del markers[markerid]

    def _iterencode_dict(dct, _current_indent_level):
        if not dct:
            yield '{}'
            return
        if markers is not None:
            markerid = id(dct)
            if markerid in markers:
                raise ValueError("Circular reference detected")
            markers[markerid] = dct
        yield '{'
        if _indent is not None:
            _current_indent_level += 1
            newline_indent = '\n' + (' ' * (_indent * _current_indent_level))
            item_separator = _item_separator + newline_indent
            yield newline_indent
        else:
            newline_indent = None
            item_separator = _item_separator
        first = True
        if _sort_keys:
            items = sorted(dct.items(), key=lambda kv: kv[0])
        else:
            items = dct.iteritems()
        for key, value in items:
            if isinstance(key, basestring):
                pass
            # JavaScript is weakly typed for these, so it makes sense to
            # also allow them.  Many encoders seem to do something like this.
            elif isinstance(key, float):
                key = _floatstr(key)
            elif key is True:
                key = 'true'
            elif key is False:
                key = 'false'
            elif key is None:
                key = 'null'
            elif isinstance(key, (int, long)):
                key = str(key)
            elif _skipkeys:
                continue
            else:
                raise TypeError("key " + repr(key) + " is not a string")
            if first:
                first = False
            else:
                yield item_separator
            yield _encoder(key)
            yield _key_separator
            if isinstance(value, basestring):
                yield _encoder(value)
            elif value is None:
                yield 'null'
            elif value is True:
                yield 'true'
            elif value is False:
                yield 'false'
            elif isinstance(value, (int, long)):
                yield str(value)
            elif isinstance(value, float):
                yield _floatstr(value)
            else:
                if isinstance(value, (list, tuple)):
                    chunks = _iterencode_list(value, _current_indent_level)
                elif isinstance(value, dict):
                    chunks = _iterencode_dict(value, _current_indent_level)
                else:
                    chunks = _iterencode(value, _current_indent_level)
                for chunk in chunks:
                    yield chunk
        if newline_indent is not None:
            _current_indent_level -= 1
            yield '\n' + (' ' * (_indent * _current_indent_level))
        yield '}'
        if markers is not None:
            del markers[markerid]

    def _iterencode(o, _current_indent_level):
        if isinstance(o, basestring):
            yield _encoder(o)
        elif o is None:
            yield 'null'
        elif o is True:
            yield 'true'
        elif o is False:
            yield 'false'
        elif isinstance(o, (int, long)):
            yield str(o)
        elif isinstance(o, float):
            yield _floatstr(o)
        elif isinstance(o, dict):
            for chunk in _iterencode_dict(o, _current_indent_level):
                yield chunk
        elif hasattr(o, '__iter__'):
            for chunk in _iterencode_list(o, _current_indent_level):
                yield chunk
        else:
            if markers is not None:
                markerid = id(o)
                if markerid in markers:
                    raise ValueError("Circular reference detected")
                markers[markerid] = o
            o = _default(o)
            for chunk in _iterencode(o, _current_indent_level):
                yield chunk
            if markers is not None:
                del markers[markerid]

    return _iterencode

encoder._make_iterencode = _make_iterencode
like image 30
Zectbumo Avatar answered Oct 30 '22 14:10

Zectbumo