I get a very large JSON stream (several GB) from <code>curl</code> and try to process it with <code>jq</code>. The relevant output I want to parse with <code>jq</code> is packed in a document representing the result structure: <pre class="prettyprint"><code>{ "results":[ { "columns": ["n"], // get this "data": [ {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]}, {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]} // ... millions of rows ] } ], "errors": [] } </code></pre> I want to extract the <code>row</code> data with <code>jq</code>. This is simple: <pre class="prettyprint"><code>curl XYZ | jq -r -c '.results[0].data[0].row[]' </code></pre> Result: <pre class="prettyprint"><code>{"key1": "row1", "key2": "row1"} {"key1": "row2", "key2": "row2"} </code></pre> However, this always waits until <code>curl</code> is completed. I played with the <code>--stream</code> option which is made for dealing with this. I tried the following command but is also waits until the full object is returned from <code>curl</code>: <pre class="prettyprint"><code>curl XYZ | jq -n --stream 'fromstream(1|truncate_stream(inputs)) | .[].data[].row[]' </code></pre> Is there a way to 'jump' to the <code>data</code> field and start parsing <code>row</code> one by one without waiting for closing tags?

(1) The vanilla filter you would use would be as follows: <pre class="prettyprint"><code>jq -r -c '.results[0].data[].row' </code></pre> (2) One way to use the streaming parser here would be to use it to process the output of <code>.results[0].data</code>, but the combination of the two steps will probably be slower than the vanilla approach. (3) To produce the output you want, you could run: <pre class="prettyprint"><code>jq -nc --stream ' fromstream(inputs | select( [.[0][0,2,4]] == ["results", "data", "row"]) | del(.[0][0:5]) )' </code></pre> (4) Alternatively, you may wish to try something along these lines: <pre class="prettyprint"><code>jq -nc --stream 'inputs | select(length==2) | select( [.[0][0,2,4]] == ["results", "data", "row"]) | [ .[0][6], .[1]] ' </code></pre> For the illustrative input, the output from the last invocation would be: <code> ["key1","row1"] ["key2","row1"] ["key1","row2"] ["key2","row2"] </code>

To get: <pre class="prettyprint"><code>{"key1": "row1", "key2": "row1"} {"key1": "row2", "key2": "row2"} </code></pre> From: <pre class="prettyprint"><code>{ "results":[ { "columns": ["n"], "data": [ {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]}, {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]} ] } ], "errors": [] } </code></pre> Do the following, which is equivalent to <code>jq -c '.results[].data[].row[]'</code>, but using streaming: <pre class="prettyprint"><code>jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row") | del(.[0][0:5])))' </code></pre> What this does is: <ul> <li>Turn the JSON into a stream (with <code>--stream</code>)</li> <li>Select the path <code>.results[].data[].row[]</code> (with <code>select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row"</code>)</li> <li>Discard those initial parts of the path, like <code>"results",0,"data",0,"row"</code> (with <code>del(.[0][0:5])</code>)</li> <li>And finally turn the resulting jq stream back into the expected JSON with the <code>fromstream(1|truncate_stream(…))</code> pattern from the jq FAQ </li> </ul> For example: <pre class="prettyprint"><code>echo ' { "results":[ { "columns": ["n"], "data": [ {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]}, {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]} ] } ], "errors": [] } ' | jq -cn --stream ' fromstream(1|truncate_stream( inputs | select( .[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row" ) | del(.[0][0:5]) ))' </code></pre> Produces the desired output.

Process large JSON stream with jq

Tags:

json

jq

I get a very large JSON stream (several GB) from curl and try to process it with jq.

The relevant output I want to parse with jq is packed in a document representing the result structure:

{
  "results":[
    {
      "columns": ["n"],

      // get this
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
        {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
      //  ... millions of rows      

      ]
    }
  ],
  "errors": []
}

I want to extract the row data with jq. This is simple:

curl XYZ | jq -r -c '.results[0].data[0].row[]'

Result:

{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}

However, this always waits until curl is completed.

I played with the --stream option which is made for dealing with this. I tried the following command but is also waits until the full object is returned from curl:

curl XYZ | jq -n --stream 'fromstream(1|truncate_stream(inputs)) | .[].data[].row[]'

Is there a way to 'jump' to the data field and start parsing row one by one without waiting for closing tags?

368

asked Aug 30 '16 15:08

Martin Preusse

2 Answers

(1) The vanilla filter you would use would be as follows:

jq -r -c '.results[0].data[].row'

(2) One way to use the streaming parser here would be to use it to process the output of .results[0].data, but the combination of the two steps will probably be slower than the vanilla approach.

(3) To produce the output you want, you could run:

jq -nc --stream '
  fromstream(inputs
    | select( [.[0][0,2,4]] == ["results", "data", "row"])
    | del(.[0][0:5]) )'

(4) Alternatively, you may wish to try something along these lines:

jq -nc --stream 'inputs
      | select(length==2)
      | select( [.[0][0,2,4]] == ["results", "data", "row"])
      | [ .[0][6], .[1]] '

For the illustrative input, the output from the last invocation would be:

["key1","row1"] ["key2","row1"] ["key1","row2"] ["key2","row2"]

answered Sep 17 '22 04:09

peak

To get:

{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}

From:

{
  "results":[
    {
      "columns": ["n"],
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
        {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
      ]
    }
  ],
  "errors": []
}

Do the following, which is equivalent to jq -c '.results[].data[].row[]', but using streaming:

jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row") | del(.[0][0:5])))'

What this does is:

Turn the JSON into a stream (with --stream)
Select the path .results[].data[].row[] (with select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row")
Discard those initial parts of the path, like "results",0,"data",0,"row" (with del(.[0][0:5]))
And finally turn the resulting jq stream back into the expected JSON with the fromstream(1|truncate_stream(…)) pattern from the jq FAQ

For example:

echo '
  {
    "results":[
      {
        "columns": ["n"],
        "data": [    
          {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
          {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
        ]
      }
    ],
    "errors": []
  }
' | jq -cn --stream '
  fromstream(1|truncate_stream(
    inputs | select(
      .[0][0] == "results" and 
      .[0][2] == "data" and 
      .[0][4] == "row"
    ) | del(.[0][0:5])
  ))'

Produces the desired output.

answered Sep 17 '22 04:09

James McKinney

Related questions
                            
                                How to convert case class to JSON in Play framework 2.3.x (Scala)?
                            
                                How to load directory of JSON files into Apache Spark in Python
                            
                                org.json.JSONObject vs. javax.json.JsonObject?
                            
                                merge two json object based on key value in javascript
                            
                                Appending (pushing) and removing from a JSON array in PostgreSQL 9.2, 9.3, and 9.4?
                            
                                how to load dynamic json to jquery datatable
                            
                                Sorting Through Graph API of Facebook Page on iOS
                            
                                Server Sent Event / EventSource with node.js (express)
                            
                                Access HTTP GET JSON property in angular controller
                            
                                How to convert JSON Array of Arrays to columns and rows
                            
                                Maintain order when dumping dict to JSON
                            
                                Rails disable JSON parsing of POST/PUT/PATCH body
                            
                                How to set colors for plot values in outside series ZingChart
                            
                                JSON ORDER_MAP_ENTRIES_BY_KEYS not working consistently
                            
                                How do I turn a JSON file into a Java 8 Object Stream?
                            
                                request json validation in the flask
                            
                                Markdown JSON string
                            
                                build_json_object issue with nested JSON
                            
                                What is different between JsonObject and JSONObject
                            
                                Mysql 5.7 native json support - control keys order in json_insert function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With