I need to split large files (~5G
) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to read the entire JSON blob into memory). The JSON data in each source file is an array of objects.
Unfortunately, the source data is not newline-delimited JSON and in some cases there are no newlines in the files at all. This means I can't simply use the split
command to split the large file into smaller chunks by newline. Here are examples of how the source data is stored in each file:
Example of a source file with newlines.
[{"id": 1, "name": "foo"}
,{"id": 2, "name": "bar"}
,{"id": 3, "name": "baz"}
...
,{"id": 9, "name": "qux"}]
Example of a source file without newlines.
[{"id": 1, "name": "foo"}, {"id": 2, "name": "bar"}, ...{"id": 9, "name": "qux"}]
Here's an example of the desired format for a single output file:
{"id": 1, "name": "foo"}
{"id": 2, "name": "bar"}
{"id": 3, "name": "baz"}
I'm able to achieve the desired result by using jq
and split
as described in this SO Post. This approach is memory efficient thanks to the jq
streaming parser. Here's the command that achieves the desired result:
cat large_source_file.json \
| jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
| split --line-bytes=1m --numeric-suffixes - split_output_file
The command above takes ~47 mins
to process through the entire source file. This seems quite slow, especially when compared to sed
which can produce the same output much faster.
Here are some performance benchmarks to show processing time with jq
vs. sed
.
export SOURCE_FILE=medium_source_file.json # smaller 250MB
# using jq
time cat ${SOURCE_FILE} \
| jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
| split --line-bytes=1m - split_output_file
real 2m0.656s
user 1m58.265s
sys 0m6.126s
# using sed
time cat ${SOURCE_FILE} \
| sed -E 's#^\[##g' \
| sed -E 's#^,\{#\{#g' \
| sed -E 's#\]$##g' \
| sed 's#},{#}\n{#g' \
| split --line-bytes=1m - sed_split_output_file
real 0m25.545s
user 0m5.372s
sys 0m9.072s
jq
compared to sed
? It makes sense jq
would be slower given it's doing a lot of validation under the hood, but 4X slower doesn't seem right.jq
can process this file? I'd prefer to use jq
to process files because I'm confident it could seamlessly handle other line output formats, but given I'm processing thousands of files each day, it's hard to justify the speed difference I've observed.Sometimes you have a large JSON file that you want to understand. Opening it in a text editor is too slow, and it may take too long to format it. You can use jq ! Note for really large files, you can have a look at the --streaming option.
jq is written in C and has no runtime dependencies, so we know it is going to be fast.
jq is an amazing little command line utility for working with JSON data. We've written before about how you can use jq to parse JSON on the command line, but in this post I want to talk about using jq to create JSON data from scratch or make changes to existing data.
jq's streaming parser (the one invoked with the --stream command-line option) intentionally sacrifices speed for the sake of reduced memory requirements, as illustrated below in the metrics section. A tool which strikes a different balance (one which seems to be closer to what you're looking for) is jstream
, the homepage of which is https://github.com/bcicen/jstream
Running the sequence of commands in a bash or bash-like shell:
cd
go get github.com/bcicen/jstream
cd go/src/github.com/bcicen/jstream/cmd/jstream/
go build
will result in an executable, which you can invoke like so:
jstream -d 1 < INPUTFILE > STREAM
Assuming INPUTFILE contains a (possibly ginormous) JSON array, the above will behave like jq's .[]
, with jq's -c (compact) command-line option. In fact, this is also the case if INPUTFILE contains a stream of JSON arrays, or a stream of JSON non-scalars ...
For the task at hand (streaming the top-level items of an array):
mrss u+s
jq --stream: 2 MB 447
jstream : 8 MB 114
jq : 5,582 MB 39
In words:
space
: jstream is economical with memory, but not as much as jq's streaming parser.
time
: jstream runs slightly slower than jq's regular parser
but about 4 times faster than jq's streaming parser.
Interestingly, space*time is about the same for the two streaming parsers.
The test file consists of an array of 10,000,000 simple objects:
[
{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
,{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
...
]
$ ls -l input.json
-rw-r--r-- 1 xyzzy staff 980000002 May 2 2019 input.json
$ wc -l input.json
10000001 input.json
$ /usr/bin/time -l jq empty input.json
43.91 real 37.36 user 4.74 sys
4981452800 maximum resident set size
$ /usr/bin/time -l jq length input.json
10000000
48.78 real 41.78 user 4.41 sys
4730941440 maximum resident set size
/usr/bin/time -l jq type input.json
"array"
37.69 real 34.26 user 3.05 sys
5582196736 maximum resident set size
/usr/bin/time -l jq 'def count(s): reduce s as $i (0;.+1); count(.[])' input.json
10000000
39.40 real 35.95 user 3.01 sys
5582176256 maximum resident set size
/usr/bin/time -l jq -cn --stream 'fromstream(1|truncate_stream(inputs))' input.json | wc -l
449.88 real 444.43 user 2.12 sys
2023424 maximum resident set size
10000000
$ /usr/bin/time -l jstream -d 1 < input.json > /dev/null
61.63 real 79.52 user 16.43 sys
7999488 maximum resident set size
$ /usr/bin/time -l jstream -d 1 < input.json | wc -l
77.65 real 93.69 user 20.85 sys
7847936 maximum resident set size
10000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With