Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deserializing a huge json string to python objects

I am using simplejson to deserialize json string to python objects. I have a custom written object_hook that takes care of deserializing the json back to my domain objects.

The problem is, when my json string is huge (i.e. the server is returning around 800K domain objects in the form of a json string), my python deserializer is taking almost 10 minutes to deserialize them.

I drilled down a bit further and it looks like simplejson as such is not doing much work rather it's delegating everything to the object_hook. I tried optimizing my object_hook but that too is not improving my performance. (I hardly got 1 min improvement)

My question is, do we have any other standard framework that is optimized to handle huge data set or is there a way where I can utilize the framework's capability rather than doing everything at object_hook level.

I see that without object_hook the framework returns just a list of dictionaries not list of domain objects.

Any pointers here will be useful.

FYI I am using simplejson version 3.7.2

Here is my sample _object_hook:

def _object_hook(dct):
    if '@CLASS' in dct: # server sends domain objects with this @CLASS 
        clsname = dct['@CLASS']
        # This is like Class.forName (This imports the module and gives the class)
        cls = get_class(clsname)
        # As my server is in java, I convert the attributes to python as per python naming convention.
        dct = dict( (convert_java_name_to_python(k), dct[k]) for k in dct.keys())
       if cls != None:
            obj_key = None
            if "@uuid"in dct
                obj_key = dct["@uuid"]
                del(dct["@uuid"])
            else:
                info("Class missing uuid: " + clsname)
            dct.pop("@CLASS", None)

            obj = cls(**dct) #This I found to be the most time consuming process. In my domian object, in the __init__ method I have the logic to set all attributes based on the kwargs passed 
            if obj_key is not None:
                shared_objs[obj_key] = obj #I keep all uuids along with the objects in shared_objs dictionary. This shared_objs will be used later to replace references.
        else:
            warning("class not found: " + clsname)
            obj = dct

        return obj
    else:
        return dct

A Sample response:

    {"@CLASS":"sample.counter","@UUID":"86f26a0a-1a58-4429-a762-  9b1778a99c82","val1":"ABC","val2":1131,"val3":1754095,"value4":  {"@CLASS":"sample.nestedClass","@UUID":"f7bb298c-fd0b-4d87-bed8-  74d5eb1d6517","id":1754095,"name":"XYZ","abbreviation":"ABC"}}

I have many levels of nesting and the number of records I am receiving from server is more than 800K.

like image 628
pragnya Avatar asked Jun 16 '16 14:06

pragnya


People also ask

How do you deserialize a JSON string in Python?

To deserialize the string to a class object, you need to write a custom method to construct the object. You can add a static method to ImageLabelCollection inside of which you construct Label objects from the loaded JSON dictionary and then assign them as a list to the class variable bbox.

How do you convert a string to a JSON object in Python?

Use the json.loads() function. The json. loads() function accepts as input a valid string and converts it to a Python dictionary. This process is called deserialization – the act of converting a string to an object.

What converts .JSON file into a Python object?

Use jsonpickle module to convert JSON data into a custom Python Object. jsonpickle is a Python library designed to work with complex Python Objects. You can use jsonpickle for serialization and deserialization complex Python and JSON Data.

How do I deserialize JSON?

A common way to deserialize JSON is to first create a class with properties and fields that represent one or more of the JSON properties. Then, to deserialize from a string or a file, call the JsonSerializer. Deserialize method.


1 Answers

I don't know of any framework that offers what you seek out of the box, but you may apply a few optimizations to the way your class instance is setup.

Since unpacking the dictionary into keyword arguments and applying them to your class variables is taking the bulk of the time, you may consider passing the dct directly to your class __init__ and setting up the class dictionary cls.__dict__ with dct:

Trial 1

In [1]: data = {"name": "yolanda", "age": 4}

In [2]: class Person:
   ...:     def __init__(self, name, age):
   ...:         self.name = name
   ...:         self.age = age
   ...:
In [3]: %%timeit
   ...: Person(**data)
   ...:
1000000 loops, best of 3: 926 ns per loop

Trial 2

In [4]: data = {"name": "yolanda", "age": 4}

In [5]: class Person2:
   ....:     def __init__(self, data):
   ....:         self.__dict__ = data
   ....:
In [6]: %%timeit
   ....: Person2(data)
   ....:
1000000 loops, best of 3: 541 ns per loop

There will be no worries about the self.__dict__ being modified via another reference since the reference to dct is lost before _object_hook returns.

This will of course mean changing the set up of your __init__, with the attributes of your class strictly depending on the items in dct. It's up to you.


You may also replace cls != None with cls is not None (there is only one None object so an identity check is more pythonic):

Trial 1

In [38]: cls = 5
In [39]: %%timeit
   ....: cls != None
   ....:
10000000 loops, best of 3: 85.8 ns per loop

Trial 2

In [40]: %%timeit
   ....: cls is not None
   ....:
10000000 loops, best of 3: 57.8 ns per loop

And you can replace two lines with one with:

obj_key = dct["@uuid"]
del(dct["@uuid"])

becoming:

obj_key = dct.pop('@uuid') # Not an optimization as this is same with the above

On a scale of 800K domain objects, these would save you some good time on getting the object_hook to create your objects more quickly.

like image 158
Moses Koledoye Avatar answered Oct 05 '22 09:10

Moses Koledoye