Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort huge JSON file using bash or python

Requirement: I have a Json file which is in .gz format. So, when it is compressed it is around ~500 MB in size. When I extract it, the json file becomes nearly around ~10 GB. The extracted JSON file contains individual JSON objects line by line.What I want is to sort the file based on a field ps using either any bash script or python programs.

Because the file is too large, its not advisable to load it into memory. So, I used gzcat and cat bash command to stream the JSON data and then pipe them to jq for sorting purpose. But either the system doesn't respond during the process or I get empty file in the output.json

>cat  sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"
>gzcat  sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"

Hardware: 16GB RAM, core i5 processor

Sample JSON Data:-

{
    "ps":"abc"
    ....
}
{   
    "ps":"def"
    ......
}
{
    "ps":"abc"
    ....
}

Expected output:

{
    "ps":"abc"
    ....
}
{   
    "ps":"abc"
    ....
}
{
    "ps":"def"
    ....
}

I don't understand what I am doing wrong. Can anyone suggest how to sort such huge JSON file ? Links I followed: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism

Also, is there any way I can do via any Map reduce without Hadoop ?

Approach-1: Streaming data to local Sqlite DB.

import sqlite3
import fileinput

PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO feeds (data) VALUES (?)"

def db_connect(db_path=PATH):
    con = sqlite3.connect(db_path)
    return con

con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj

record_count = 0
for line in fileinput.input():
    cur.execute(insert_query,(line,))

command line:

>gzcat sth.json.gz | python insert.py
like image 607
dks551 Avatar asked Apr 16 '19 16:04

dks551


1 Answers

Here is one solution based on the suggestion in one of the comments:

If you can e.g. prefix the lines with the sort key so that they can be sorted as text rather than JSON, then GNU sort can easily sort 10GB+ files without loading them into memory. – that other guy

You can use jq to do this along the following lines:

jq -cr '"\(.ps)\t\(.)"' 

This will produce lines with tab-separated values like so:

abc {"ps":"abc","x":0}
abc {"ps":"abc","x":1}

Using the -c option ensures that each pair (i.e. the sorting key and object) is written to a single line.

Now you can easily sort the lines, e.g. using sort; and then use e.g. cut to strip the .ps field.

Finally, if you really want the output to be formatted, you can again use jq ( e.g. jq .), the point being that jq is by default stream-oriented.

Caveat

The above assumes that the .ps values are tab-free. If that is not the case, then you could either use a different field-separator, or:

jq -cr '([.ps] | @tsv) + "\t" + tostring'
like image 80
peak Avatar answered Oct 12 '22 06:10

peak