I get a very large JSON stream (several GB) from curl
and try to process it with jq
.
The relevant output I want to parse with jq
is packed in a document representing the result structure:
{
"results":[
{
"columns": ["n"],
// get this
"data": [
{"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
{"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
// ... millions of rows
]
}
],
"errors": []
}
I want to extract the row
data with jq
. This is simple:
curl XYZ | jq -r -c '.results[0].data[0].row[]'
Result:
{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}
However, this always waits until curl
is completed.
I played with the --stream
option which is made for dealing with this. I tried the following command but is also waits until the full object is returned from curl
:
curl XYZ | jq -n --stream 'fromstream(1|truncate_stream(inputs)) | .[].data[].row[]'
Is there a way to 'jump' to the data
field and start parsing row
one by one without waiting for closing tags?
You can use jq ! Note for really large files, you can have a look at the --streaming option. But there's no way to do random access on large JSON files because without a semi-index.
Concatenated JSON streaming allows the sender to simply write each JSON object into the stream with no delimiters. It relies on the receiver using a parser that can recognize and emit each JSON object as the terminating character is parsed.
jq is a free open source JSON processor that is flexible and straightforward to use. It allows users to display a JSON file using standard formatting, or to retrieve certain records or attribute-value pairs from it.
A jq program is a "filter": it takes an input, and produces an output. There are a lot of builtin filters for extracting a particular field of an object, or converting a number to a string, or various other standard tasks.
(1) The vanilla filter you would use would be as follows:
jq -r -c '.results[0].data[].row'
(2) One way to use the streaming parser here would be to use it to process the output of .results[0].data
, but the combination of the two steps will probably be slower than the vanilla approach.
(3) To produce the output you want, you could run:
jq -nc --stream '
fromstream(inputs
| select( [.[0][0,2,4]] == ["results", "data", "row"])
| del(.[0][0:5]) )'
(4) Alternatively, you may wish to try something along these lines:
jq -nc --stream 'inputs
| select(length==2)
| select( [.[0][0,2,4]] == ["results", "data", "row"])
| [ .[0][6], .[1]] '
For the illustrative input, the output from the last invocation would be:
["key1","row1"]
["key2","row1"]
["key1","row2"]
["key2","row2"]
To get:
{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}
From:
{
"results":[
{
"columns": ["n"],
"data": [
{"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
{"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
]
}
],
"errors": []
}
Do the following, which is equivalent to jq -c '.results[].data[].row[]'
, but using streaming:
jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row") | del(.[0][0:5])))'
What this does is:
--stream
).results[].data[].row[]
(with select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row"
)"results",0,"data",0,"row"
(with del(.[0][0:5])
)fromstream(1|truncate_stream(…))
pattern from the jq FAQ
For example:
echo '
{
"results":[
{
"columns": ["n"],
"data": [
{"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
{"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
]
}
],
"errors": []
}
' | jq -cn --stream '
fromstream(1|truncate_stream(
inputs | select(
.[0][0] == "results" and
.[0][2] == "data" and
.[0][4] == "row"
) | del(.[0][0:5])
))'
Produces the desired output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With