Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load a large JSON file using multi threading/

Tags:

json

jq

I am trying to load a large 3 GB JSON file. Currently, with JQ utility I can load the entire file in nearly 40 mins. Now, I want to know how I can use parallelism/multi threading approach in JQ in order to complete the process in less amount of time. I am using v1.5

Command Used:

JQ.exe -r -s "map(.\"results\" | map({\"ID\": (((.\"body\"?.\"party\"?.\"xrefs\"?.\"xref\"//[] | map(select(ID))[]?.\"id\"?))//null), \"Name\": (((.\"body\"?.\"party\"?.\"general-info\"?.\"full-name\"?))//null)} | [(.\"ID\"//\"\"|tostring), (.\"Name\"//\"\"|tostring)])) | add[] | join(\"~\")" "\C:\InputFile.txt" >"\C:\OutputFile.txt"

My data:

{
  "results": [
    {
      "_id": "0000001",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2011-02-14T08:21:51.000-05:00",
            "full-name": "Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades",
            "status": "ACTIVE",
            "last-update-user": "TS42922",
            "create-date": "2011-02-14T08:21:51.000-05:00",
            "classifications": {
              "classification": [
                {
                  "code": "PENS"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "LOCCU1",
                "id": "X00893X"
              },
              {
                "type": "ID",
                "id": "1012227139"
              }
            ]
          }
        }
      }
    },
    {
      "_id": "000002",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2015-05-21T15:10:45.174-04:00",
            "full-name": "Innova Capital Sp zoo",
            "status": "ACTIVE",
            "last-update-user": "jw74592",
            "create-date": "1994-08-31T00:00:00.000-04:00",
            "classifications": {
              "classification": [
                {
                  "code": "CORP"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "ULTDUN",
                "id": "144349875"
              },
              {
                "type": "AVID",
                "id": "6098743"
              },
              {
                "type": "LOCCU1",
                "id": "1001210218"
              },
              {
                "type": "ID",
                "id": "1001210218"
              },
              {
                "type": "BLMBRG",
                "id": "10009050"
              },
              {
                "type": "REG_CO",
                "id": "0000068508"
              },
              {
                "type": "SMCI",
                "id": "13159"
              }
            ]
          }
        }
      }
    }
  ]
}

Can someone please help me which command I need to use in v1.5 in order to achieve parallelism/multithreading.

like image 276
Saikat Saha Avatar asked Oct 20 '22 08:10

Saikat Saha


1 Answers

Here is a streaming approach which assumes your 3GB data file is in data.json and the following filter is in filter1.jq:

  select(length==2)
| . as [$p, $v]
| {r:$p[1]}
| if   $p[2:6] == ["body","party","general-info","full-name"]       then .name = $v
  elif $p[2:6] == ["body","party","xrefs","xref"] and $p[7] == "id" then .id   = $v
  else  empty
  end      

When you run jq with

$ jq -M -c --stream -f filter1.jq data.json

jq will produce a stream of results with minimal details you need

{"r":0,"name":"Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades"}
{"r":0,"id":"X00893X"}
{"r":0,"id":"1012227139"}
{"r":1,"name":"Innova Capital Sp zoo"}
{"r":1,"id":"144349875"}
{"r":1,"id":"6098743"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"10009050"}
{"r":1,"id":"0000068508"}
{"r":1,"id":"13159"}

which you can convert to your desired format by using a second filter2.jq:

foreach .[] as $i (
     {c: null, r:null, id:null, name:null}

   ; .c = $i
   | if .r != .c.r then .id=null | .name=null | .r=.c.r else . end   # control break
   | .id   = if .c.id == null   then .id   else .c.id   end
   | .name = if .c.name == null then .name else .c.name end

   ; [.id, .name]
   | if contains([null]) then empty else . end
   | join("~")
)

which consumes the output of the first filter when run with

$ jq -M -c --stream -f filter1.jq data.json | jq -M -s -r -f filter2.jq

and produces

X00893X~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
1012227139~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
144349875~Innova Capital Sp zoo
6098743~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
10009050~Innova Capital Sp zoo
0000068508~Innova Capital Sp zoo
13159~Innova Capital Sp zoo

This might be all you need using just two jq processes. If you need more parallelism you could use the record number (r) as to partition the data and process the partitions in parallel. For example, if you save the intermediate output into a temp.json file

$ jq -M -c --stream -f filter1.jq data.json > temp.json

then you could process temp.json in parallel with filters such as

$ jq -M 'select(0==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result0.out &
$ jq -M 'select(1==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result1.out &
$ jq -M 'select(2==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result2.out &

and concatenate your partitions into a single result at the end if necessary. This example uses 3 partitions but you could easily extend this approach to any number of partitions if you need more parallelism.

GNU parallel is also a good option. As mentioned in the JQ Cookbook, jq-hopkok's parallelism folder has some good examples

like image 133
jq170727 Avatar answered Dec 05 '22 17:12

jq170727