Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generic code to split large JSON based on node name

I have a very large JSON file, now the car array below can be upto 100,000,000 records. The total file size can vary from 500mb to 10 GB. I am using Newtonsoft json.net

Input

{
"name": "John",
"age": "30",
"cars": [{
    "brand": "ABC",
    "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
    "year": "2019",
    "month": "1",
    "day": "1"
}, {
    "brand": "XYZ",
    "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
    "year": "2019",
    "month": "10",
    "day": "01"
}],
"TestCity": "TestCityValue",
"TestCity1": "TestCityValue1"}

Desired Output File 1 Json

   {
    "name": "John",
    "age": "30",
    "cars": {
        "brand": "ABC",
        "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
        "year": "2019",
        "month": "1",
        "day": "1"
    },
    "TestCity": "TestCityValue",
    "TestCity1": "TestCityValue1"
}

File 2 Json

{
    "name": "John",
    "age": "30",
    "cars": {
        "brand": "XYZ",
        "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
        "year": "2019",
        "month": "10",
        "day": "01"
    },
    "TestCity": "TestCityValue",
    "TestCity1": "TestCityValue1"
}

So I came up with the following code which kinda works

 public static void SplitJson(Uri objUri, string splitbyProperty)
    {
        try
        {
            bool readinside = false;
            HttpClient client = new HttpClient();
            using (Stream stream = client.GetStreamAsync(objUri).Result)
            using (StreamReader streamReader = new StreamReader(stream))
            using (JsonTextReader reader = new JsonTextReader(streamReader))
            {
                Node objnode = new Node();
                while (reader.Read())
                {
                    JObject obj = new JObject(reader);


                    if (reader.TokenType == JsonToken.String && reader.Path.ToString().Contains("name") && !reader.Value.ToString().Equals(reader.Path.ToString()))
                    {
                        objnode.name = reader.Value.ToString();
                    }

                    if (reader.TokenType == JsonToken.Integer && reader.Path.ToString().Contains("age") && !reader.Value.ToString().Equals(reader.Path.ToString()))
                    {
                        objnode.age = reader.Value.ToString();

                    }

                    if (reader.Path.ToString().Contains(splitbyProperty) && reader.TokenType == JsonToken.StartArray)
                    {
                        int counter = 0;
                        while (reader.Read())
                        {
                            if (reader.TokenType == JsonToken.StartObject)
                            {
                                counter = counter + 1;
                                var item = JsonSerializer.Create().Deserialize<Car>(reader);
                                objnode.cars = new List<Car>();
                                objnode.cars.Add(item);
                                insertIntoFileSystem(objnode, counter);
                            }

                            if (reader.TokenType == JsonToken.EndArray)
                                break;
                        }
                    }

                }

            }

        }
        catch (Exception)
        {

            throw;
        }
    }
    public static void insertIntoFileSystem(Node objNode, int counter)
    {

        string fileName = @"C:\Temp\output_" + objNode.name + "_" + objNode.age + "_" + counter + ".json";
        var serialiser = new JsonSerializer();
        using (TextWriter tw = new StreamWriter(fileName))
        {
            using (StringWriter textWriter = new StringWriter())
            {
                serialiser.Serialize(textWriter, objNode);
                tw.WriteLine(textWriter);
            }
        }
    }

ISSUE

  1. Any field after the array is not being captured when file is large in size. Is there a way to skip or do parallel processing of the reader for large array in json. In short I am not able to capture the below part using my code

    "TestCity": "TestCityValue", "TestCity1": "TestCityValue1"}

like image 964
Tushar Narang Avatar asked Jan 23 '19 12:01

Tushar Narang


People also ask

How do I parse a large JSON file in node JS?

To parse large JSON file in Node. js, we call fs. createReadStream and use the JSONStream library. const fs = require("fs"); const JSONStream = require("JSONStream"); const getStream = () => { const jsonData = "myData.

Can we split JSON file?

Using BigTextFileSplitter, you can split large JSON file in Windows easily and fast, just a few mouse clicks!

How big is too big for JSON?

One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB.


Video Answer


1 Answers

You are going to need to process your large JSON file in two passes to achieve the result you want.

In the first pass, split the file into two: create a file containing just the huge array, and a second file which contains all the other information, which will be used as a template for the individual JSON files you ultimately want to create.

In the second pass, read the template file into memory (I'm assuming this part of the JSON is relatively smallish so this should not be a problem), then use a reader to process the array file one item at a time. For each item, combine it with the template and write it to a separate file.

At the end, you can delete the temporary array and template files.

Here is what it might look like in code:

using System.IO;
using System.Text;
using System.Net.Http;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;

public static void SplitJson(Uri objUri, string arrayPropertyName)
{
    string templateFileName = @"C:\Temp\template.json";
    string arrayFileName = @"C:\Temp\array.json";

    // Split the original JSON stream into two temporary files:
    // one that has the huge array and one that has everything else
    HttpClient client = new HttpClient();
    using (Stream stream = client.GetStreamAsync(objUri).Result)
    using (JsonReader reader = new JsonTextReader(new StreamReader(inputStream)))
    using (JsonWriter templateWriter = new JsonTextWriter(new StreamWriter(templateFileName)))
    using (JsonWriter arrayWriter = new JsonTextWriter(new StreamWriter(arrayFileName)))
    {
        if (reader.Read() && reader.TokenType == JsonToken.StartObject)
        {
            templateWriter.WriteStartObject();
            while (reader.Read() && reader.TokenType != JsonToken.EndObject)
            {
                string propertyName = (string)reader.Value;
                reader.Read();
                templateWriter.WritePropertyName(propertyName);
                if (propertyName == arrayPropertyName)
                {
                    arrayWriter.WriteToken(reader);
                    templateWriter.WriteStartObject();  // empty placeholder object
                    templateWriter.WriteEndObject();
                }
                else if (reader.TokenType == JsonToken.StartObject ||
                         reader.TokenType == JsonToken.StartArray)
                {
                    templateWriter.WriteToken(reader);
                }
                else
                {
                    templateWriter.WriteValue(reader.Value);
                }
            }
            templateWriter.WriteEndObject();
        }
    }

    // Now read the huge array file and combine each item in the array
    // with the template to make new files
    JObject template = JObject.Parse(File.ReadAllText(templateFileName));
    using (JsonReader arrayReader = new JsonTextReader(new StreamReader(arrayFileName)))
    {
        int counter = 0;
        while (arrayReader.Read())
        {
            if (arrayReader.TokenType == JsonToken.StartObject)
            {
                counter++;
                JObject item = JObject.Load(arrayReader);
                template[arrayPropertyName] = item;
                string fileName = string.Format(@"C:\Temp\output_{0}_{1}_{2}.json",
                                                template["name"], template["age"], counter);

                File.WriteAllText(fileName, template.ToString());
            }
        }
    }

    // Clean up temporary files
    File.Delete(templateFileName);
    File.Delete(arrayFileName);
}

Note the above approach will require double the disk space of the original JSON during processing due to the temporary files. If this is a problem, you can modify the code to download the file twice instead (although this will likely increase the processing time). In the first download, create the template JSON and ignore the array; in the second download, advance to the array and process it with the template as before to create the output files.

like image 155
Brian Rogers Avatar answered Sep 21 '22 05:09

Brian Rogers