Search and replace string in a very big file

Question

I have a preference for shell commands to get things done. I have a very, very big file -- about 2.8 GB and the content is that of JSON. Everything is on one line, and I was told there are at least 1.5 million records in there.

I must prepare the file for consumption. Each record must be on its own line. Sample:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}

Or, use the following...

{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale@hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}

Final outcome should be:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}

Attempted commands:

sed -e 's/,{"RecordId"/}]}, {"RecordId"/g' sample.dat
awk '{gsub(",{\"RecordId\"",", {\"RecordId\"",$0); print $0}' sample.dat

The attempted commands works perfectly fine for small files. But it does not work for the 2.8 GB file that I must manipulate. Sed quits midway after 10 mins without reason and nothing was done. Awk errored with a Segmentation Fault (core dump) reason after many hours in. I tried perl's search and replace and got an error saying "Out of memory".

Any help/ ideas would be great!

Additional info on my machine:

More than 105 GB disk space available.
8 GB memory
4 cores CPU
Running Ubuntu 14.04

ghoti · Accepted Answer

Since you've tagged your question with sed, awk AND perl, I gather that what you really need is a recommendation for a tool. While that's kind of off-topic, I believe that jq is something you could use for this. It will be better than sed or awk because it actually understands JSON. Everything shown here with jq could also be done in perl with a bit of programming.

Assuming content like the following (based on your sample):

{"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}

You can easily reformat this to "prettify" it:

$ jq '.' < data.json
{
  "RomanCharacters": {
    "Alphabet": [
      {
        "RecordId": "1",
        "data": "data"
      },
      {
        "RecordId": "2",
        "data": "data"
      },
      {
        "RecordId": "3",
        "data": "data"
      },
      {
        "RecordId": "4",
        "data": "data"
      },
      {
        "RecordId": "5",
        "data": "data"
      }
    ]
  }
}

And we can dig in to the data to retrieve only the records you're interested in (regardless of what they're wrapped in):

$ jq '.[][][]' < data.json
{
  "RecordId": "1",
  "data": "data"
}
{
  "RecordId": "2",
  "data": "data"
}
{
  "RecordId": "3",
  "data": "data"
}
{
  "RecordId": "4",
  "data": "data"
}
{
  "RecordId": "5",
  "data": "data"
}

This is much more readable, both by humans and by tools like awk which process content line-by-line. If you want to join your lines for processing per your question, the awk becomes much more simple:

$ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("
")}'
{   "RecordId": "1",   "data": "data" }
{   "RecordId": "2",   "data": "data" }
{   "RecordId": "3",   "data": "data" }
{   "RecordId": "4",   "data": "data" }
{   "RecordId": "5",   "data": "data" }

Or, as @peak suggested in comments, eliminate the awk portion of thie entirely by using jq's -c (compact output) option:

$ jq -c '.[][][]' < data.json
{"RecordId":"1","data":"data"}
{"RecordId":"2","data":"data"}
{"RecordId":"3","data":"data"}
{"RecordId":"4","data":"data"}
{"RecordId":"5","data":"data"}

neuhaus · Answer

Regarding perl: Try setting the input line separator $/ to }, like this:

#!/usr/bin/perl
$/= "},"; 
while (<>){
   print "$_
"; 
}'

or, as a one-liner:

$ perl -e '$/="},";while(<>){print "$_
"}' sample.dat

Search and replace string in a very big file

Tags:

json

data-manipulation

large-files

awk

perl

dat789

2 Answers

ghoti

neuhaus

Recent Activity

Donate For Us

Search and replace string in a very big file

Tags:

json

data-manipulation

large-files

awk

perl

dat789

2 Answers

ghoti

neuhaus

Related questions

Recent Activity

Donate For Us