Requirement: I have a Json file which is in .gz format. So, when it is compressed it is around ~500 MB in size. When I extract it, the json file becomes nearly around ~10 GB. The extracted JSON file contains individual JSON objects line by line.What I want is to sort the file based on a field ps
using either any bash script or python programs.
Because the file is too large, its not advisable to load it into memory. So, I used gzcat and cat bash command to stream the JSON data and then pipe them to jq for sorting purpose. But either the system doesn't respond during the process or I get empty file in the output.json
>cat sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'" > "output.json"
>gzcat sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'" > "output.json"
Hardware: 16GB RAM, core i5 processor
Sample JSON Data:-
{
"ps":"abc"
....
}
{
"ps":"def"
......
}
{
"ps":"abc"
....
}
Expected output:
{
"ps":"abc"
....
}
{
"ps":"abc"
....
}
{
"ps":"def"
....
}
I don't understand what I am doing wrong. Can anyone suggest how to sort such huge JSON file ? Links I followed: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism
Also, is there any way I can do via any Map reduce without Hadoop ?
Approach-1: Streaming data to local Sqlite DB.
import sqlite3
import fileinput
PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO feeds (data) VALUES (?)"
def db_connect(db_path=PATH):
con = sqlite3.connect(db_path)
return con
con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj
record_count = 0
for line in fileinput.input():
cur.execute(insert_query,(line,))
command line:
>gzcat sth.json.gz | python insert.py
Here is one solution based on the suggestion in one of the comments:
If you can e.g. prefix the lines with the sort key so that they can be sorted as text rather than JSON, then GNU sort can easily sort 10GB+ files without loading them into memory. – that other guy
You can use jq to do this along the following lines:
jq -cr '"\(.ps)\t\(.)"'
This will produce lines with tab-separated values like so:
abc {"ps":"abc","x":0}
abc {"ps":"abc","x":1}
Using the -c option ensures that each pair (i.e. the sorting key and object) is written to a single line.
Now you can easily sort the lines, e.g. using sort
; and then use e.g. cut
to strip the .ps field.
Finally, if you really want the output to be formatted, you can again use jq
( e.g. jq .
), the point being that jq is by default stream-oriented.
The above assumes that the .ps values are tab-free. If that is not the case, then you could either use a different field-separator, or:
jq -cr '([.ps] | @tsv) + "\t" + tostring'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With