I have a preference for shell commands to get things done. I have a very, very big file -- about 2.8 GB and the content is that of JSON. Everything is on one line, and I was told there are at least 1.5 million records in there.
I must prepare the file for consumption. Each record must be on its own line. Sample:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}
Or, use the following...
{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}
Final outcome should be:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Attempted commands:
sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat
The attempted commands works perfectly fine for small files. But it does not work for the 2.8 GB file that I must manipulate. Sed quits midway after 10 mins without reason and nothing was done. Awk errored with a Segmentation Fault (core dump) reason after many hours in. I tried perl's search and replace and got an error saying "Out of memory".
Any help/ ideas would be great!
Additional info on my machine:
Since you've tagged your question with sed, awk AND perl, I gather that what you really need is a recommendation for a tool. While that's kind of off-topic, I believe that jq is something you could use for this. It will be better than sed or awk because it actually understands JSON. Everything shown here with jq could also be done in perl with a bit of programming.
Assuming content like the following (based on your sample):
{"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}
You can easily reformat this to "prettify" it:
$ jq '.' < data.json
{
"RomanCharacters": {
"Alphabet": [
{
"RecordId": "1",
"data": "data"
},
{
"RecordId": "2",
"data": "data"
},
{
"RecordId": "3",
"data": "data"
},
{
"RecordId": "4",
"data": "data"
},
{
"RecordId": "5",
"data": "data"
}
]
}
}
And we can dig in to the data to retrieve only the records you're interested in (regardless of what they're wrapped in):
$ jq '.[][][]' < data.json
{
"RecordId": "1",
"data": "data"
}
{
"RecordId": "2",
"data": "data"
}
{
"RecordId": "3",
"data": "data"
}
{
"RecordId": "4",
"data": "data"
}
{
"RecordId": "5",
"data": "data"
}
This is much more readable, both by humans and by tools like awk which process content line-by-line. If you want to join your lines for processing per your question, the awk becomes much more simple:
$ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}'
{ "RecordId": "1", "data": "data" }
{ "RecordId": "2", "data": "data" }
{ "RecordId": "3", "data": "data" }
{ "RecordId": "4", "data": "data" }
{ "RecordId": "5", "data": "data" }
Or, as @peak suggested in comments, eliminate the awk
portion of thie entirely by using jq's -c
(compact output) option:
$ jq -c '.[][][]' < data.json
{"RecordId":"1","data":"data"}
{"RecordId":"2","data":"data"}
{"RecordId":"3","data":"data"}
{"RecordId":"4","data":"data"}
{"RecordId":"5","data":"data"}
Regarding perl: Try setting the input line separator $/
to },
like this:
#!/usr/bin/perl
$/= "},";
while (<>){
print "$_\n";
}'
or, as a one-liner:
$ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With