Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse huge JSON file as stream in Json.NET?

Tags:

json

c#

json.net

I have a very, very large JSON file (1000+ MB) of identical JSON objects. For example:

[     {         "id": 1,         "value": "hello",         "another_value": "world",         "value_obj": {             "name": "obj1"         },         "value_list": [             1,             2,             3         ]     },     {         "id": 2,         "value": "foo",         "another_value": "bar",         "value_obj": {             "name": "obj2"         },         "value_list": [             4,             5,             6         ]     },     {         "id": 3,         "value": "a",         "another_value": "b",         "value_obj": {             "name": "obj3"         },         "value_list": [             7,             8,             9         ]      },     ... ] 

Every single item in the root JSON list follows the same structure and thus would be individually deserializable. I already have the C# classes written to receive this data, and deserializing a JSON file containing a single object without the list works as expected.

At first, I tried to just directly deserialize my objects in a loop:

JsonSerializer serializer = new JsonSerializer(); MyObject o; using (FileStream s = File.Open("bigfile.json", FileMode.Open)) using (StreamReader sr = new StreamReader(s)) using (JsonReader reader = new JsonTextReader(sr)) {     while (!sr.EndOfStream)     {         o = serializer.Deserialize<MyObject>(reader);     } } 

This didn't work, threw an exception clearly stating that an object is expected, not a list. My understanding is that this command would just read a single object contained at the root level of the JSON file, but since we have a list of objects, this is an invalid request.

My next idea was to deserialize as a C# List of objects:

JsonSerializer serializer = new JsonSerializer(); List<MyObject> o; using (FileStream s = File.Open("bigfile.json", FileMode.Open)) using (StreamReader sr = new StreamReader(s)) using (JsonReader reader = new JsonTextReader(sr)) {     while (!sr.EndOfStream)     {         o = serializer.Deserialize<List<MyObject>>(reader);     } } 

This does succeed. However, it only somewhat reduces the issue of high RAM usage. In this case it does look like the application is deserializing items one at a time, and so is not reading the entire JSON file into RAM, but we still end up with a lot of RAM usage because the C# List object now contains all of the data from the JSON file in RAM. This has only displaced the problem.

I then decided to simply try taking a single character off the beginning of the stream (to eliminate the [) by doing sr.Read() before going into the loop. The first object then does read successfully, but subsequent ones do not, with an exception of "unexpected token". My guess is this is the comma and space between the objects throwing the reader off.

Simply removing square brackets won't work since the objects do contain a primitive list of their own, as you can see in the sample. Even trying to use }, as a separator won't work since, as you can see, there are sub-objects within the objects.

What my goal is, is to be able to read the objects from the stream one at a time. Read an object, do something with it, then discard it from RAM, and read the next object, and so on. This would eliminate the need to load either the entire JSON string or the entire contents of the data into RAM as C# objects.

What am I missing?

like image 722
fdmillion Avatar asked May 02 '17 21:05

fdmillion


People also ask

Can JSON be streamed?

JSON streaming comprises communications protocols to delimit JSON objects built upon lower-level stream-oriented protocols (such as TCP), that ensures individual JSON objects are recognized, when the server and clients use the same one (e.g. implicitly coded in).

How big can a JSON payload be?

How large can JSON Documents be? One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB.

What is JSONPath parse?

JSONPath is an expression language to parse JSON data. It's very similar to the XPath expression language to parse XML data. The idea is to parse the JSON data and get the value you want.


1 Answers

This should resolve your problem. Basically it works just like your initial code except it's only deserializing object when the reader hits the { character in the stream and otherwise it's just skipping to the next one until it finds another start object token.

JsonSerializer serializer = new JsonSerializer(); MyObject o; using (FileStream s = File.Open("bigfile.json", FileMode.Open)) using (StreamReader sr = new StreamReader(s)) using (JsonReader reader = new JsonTextReader(sr)) {     while (reader.Read())     {         // deserialize only when there's "{" character in the stream         if (reader.TokenType == JsonToken.StartObject)         {             o = serializer.Deserialize<MyObject>(reader);         }     } } 
like image 120
nocodename Avatar answered Sep 19 '22 22:09

nocodename