I am trying to load a large 3 GB JSON file. Currently, with JQ utility I can load the entire file in nearly 40 mins. Now, I want to know how I can use parallelism/multi threading approach in JQ in order to complete the process in less amount of time. I am using v1.5
Command Used:
JQ.exe -r -s "map(.\"results\" | map({\"ID\": (((.\"body\"?.\"party\"?.\"xrefs\"?.\"xref\"//[] | map(select(ID))[]?.\"id\"?))//null), \"Name\": (((.\"body\"?.\"party\"?.\"general-info\"?.\"full-name\"?))//null)} | [(.\"ID\"//\"\"|tostring), (.\"Name\"//\"\"|tostring)])) | add[] | join(\"~\")" "\C:\InputFile.txt" >"\C:\OutputFile.txt"
My data:
{
"results": [
{
"_id": "0000001",
"body": {
"party": {
"related-parties": {},
"general-info": {
"last-update-ts": "2011-02-14T08:21:51.000-05:00",
"full-name": "Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades",
"status": "ACTIVE",
"last-update-user": "TS42922",
"create-date": "2011-02-14T08:21:51.000-05:00",
"classifications": {
"classification": [
{
"code": "PENS"
}
]
}
},
"xrefs": {
"xref": [
{
"type": "LOCCU1",
"id": "X00893X"
},
{
"type": "ID",
"id": "1012227139"
}
]
}
}
}
},
{
"_id": "000002",
"body": {
"party": {
"related-parties": {},
"general-info": {
"last-update-ts": "2015-05-21T15:10:45.174-04:00",
"full-name": "Innova Capital Sp zoo",
"status": "ACTIVE",
"last-update-user": "jw74592",
"create-date": "1994-08-31T00:00:00.000-04:00",
"classifications": {
"classification": [
{
"code": "CORP"
}
]
}
},
"xrefs": {
"xref": [
{
"type": "ULTDUN",
"id": "144349875"
},
{
"type": "AVID",
"id": "6098743"
},
{
"type": "LOCCU1",
"id": "1001210218"
},
{
"type": "ID",
"id": "1001210218"
},
{
"type": "BLMBRG",
"id": "10009050"
},
{
"type": "REG_CO",
"id": "0000068508"
},
{
"type": "SMCI",
"id": "13159"
}
]
}
}
}
}
]
}
Can someone please help me which command I need to use in v1.5 in order to achieve parallelism/multithreading.
Here is a streaming approach which assumes your 3GB data file is in data.json
and the following filter is in filter1.jq
:
select(length==2)
| . as [$p, $v]
| {r:$p[1]}
| if $p[2:6] == ["body","party","general-info","full-name"] then .name = $v
elif $p[2:6] == ["body","party","xrefs","xref"] and $p[7] == "id" then .id = $v
else empty
end
When you run jq with
$ jq -M -c --stream -f filter1.jq data.json
jq will produce a stream of results with minimal details you need
{"r":0,"name":"Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades"}
{"r":0,"id":"X00893X"}
{"r":0,"id":"1012227139"}
{"r":1,"name":"Innova Capital Sp zoo"}
{"r":1,"id":"144349875"}
{"r":1,"id":"6098743"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"10009050"}
{"r":1,"id":"0000068508"}
{"r":1,"id":"13159"}
which you can convert to your desired format by using a second filter2.jq
:
foreach .[] as $i (
{c: null, r:null, id:null, name:null}
; .c = $i
| if .r != .c.r then .id=null | .name=null | .r=.c.r else . end # control break
| .id = if .c.id == null then .id else .c.id end
| .name = if .c.name == null then .name else .c.name end
; [.id, .name]
| if contains([null]) then empty else . end
| join("~")
)
which consumes the output of the first filter when run with
$ jq -M -c --stream -f filter1.jq data.json | jq -M -s -r -f filter2.jq
and produces
X00893X~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
1012227139~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
144349875~Innova Capital Sp zoo
6098743~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
10009050~Innova Capital Sp zoo
0000068508~Innova Capital Sp zoo
13159~Innova Capital Sp zoo
This might be all you need using just two jq processes. If you need more parallelism you could use the record number (r) as to partition the data and process the partitions in parallel. For example, if you save the intermediate output into a temp.json
file
$ jq -M -c --stream -f filter1.jq data.json > temp.json
then you could process temp.json
in parallel with filters such as
$ jq -M 'select(0==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result0.out &
$ jq -M 'select(1==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result1.out &
$ jq -M 'select(2==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result2.out &
and concatenate your partitions into a single result at the end if necessary. This example uses 3 partitions but you could easily extend this approach to any number of partitions if you need more parallelism.
GNU parallel is also a good option. As mentioned in the JQ Cookbook, jq-hopkok's parallelism folder has some good examples
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With