I have huge (~7GB) json array of relatively small objects.
Is there relatively simple way to filter these objects without loading whole file into memory?
--stream option looks suitable, but I can't figure out how to fold stream of [path,value] to original objects.
You can use jq ! Note for really large files, you can have a look at the --streaming option. But there's no way to do random access on large JSON files because without a semi-index.
jq is almost always the bottleneck in the pipeline at 100% CPU - so much so that we often add an fgrep to the left side of the pipeline to minimize the input to jq as much as possible. It's very slow.
NET stack, Json.NET is a great tool for parsing large files. It's fast, efficient, and it's the most downloaded NuGet package out there.
jq is a command-line utility that can slice, filter, and transform the components of a JSON file.
jq 1.5 has a streaming parser. The jq FAQ gives an example of how to convert a top-level array of JSON objects into a stream of its elements:
$ jq -nc --stream 'fromstream(1|truncate_stream(inputs))'
[{"foo":"bar"},{"foo":"baz"}]
{"foo":"bar"}
{"foo":"baz"}
That may be enough for your purposes, but it is worthwhile noting that setpath/2 can be helpful. Here's how to produce a stream of leaflets:
jq -c --stream '. as $in | select(length == 2) | {}|setpath($in[0]; $in[1])'
Further information and documentation is available in the jq manual: https://stedolan.github.io/jq/manual/#Streaming
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With