Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search and replace string in a very big file

I have a preference for shell commands to get things done. I have a very, very big file -- about 2.8 GB and the content is that of JSON. Everything is on one line, and I was told there are at least 1.5 million records in there.

I must prepare the file for consumption. Each record must be on its own line. Sample:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}

Or, use the following...

{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"[email protected]","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}

Final outcome should be:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}

Attempted commands:

  • sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
  • awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat

The attempted commands works perfectly fine for small files. But it does not work for the 2.8 GB file that I must manipulate. Sed quits midway after 10 mins without reason and nothing was done. Awk errored with a Segmentation Fault (core dump) reason after many hours in. I tried perl's search and replace and got an error saying "Out of memory".

Any help/ ideas would be great!

Additional info on my machine:

  • More than 105 GB disk space available.
  • 8 GB memory
  • 4 cores CPU
  • Running Ubuntu 14.04
like image 379
dat789 Avatar asked Jan 25 '16 12:01

dat789


2 Answers

Since you've tagged your question with sed, awk AND perl, I gather that what you really need is a recommendation for a tool. While that's kind of off-topic, I believe that jq is something you could use for this. It will be better than sed or awk because it actually understands JSON. Everything shown here with jq could also be done in perl with a bit of programming.

Assuming content like the following (based on your sample):

{"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}

You can easily reformat this to "prettify" it:

$ jq '.' < data.json
{
  "RomanCharacters": {
    "Alphabet": [
      {
        "RecordId": "1",
        "data": "data"
      },
      {
        "RecordId": "2",
        "data": "data"
      },
      {
        "RecordId": "3",
        "data": "data"
      },
      {
        "RecordId": "4",
        "data": "data"
      },
      {
        "RecordId": "5",
        "data": "data"
      }
    ]
  }
}

And we can dig in to the data to retrieve only the records you're interested in (regardless of what they're wrapped in):

$ jq '.[][][]' < data.json
{
  "RecordId": "1",
  "data": "data"
}
{
  "RecordId": "2",
  "data": "data"
}
{
  "RecordId": "3",
  "data": "data"
}
{
  "RecordId": "4",
  "data": "data"
}
{
  "RecordId": "5",
  "data": "data"
}

This is much more readable, both by humans and by tools like awk which process content line-by-line. If you want to join your lines for processing per your question, the awk becomes much more simple:

$ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}'
{   "RecordId": "1",   "data": "data" }
{   "RecordId": "2",   "data": "data" }
{   "RecordId": "3",   "data": "data" }
{   "RecordId": "4",   "data": "data" }
{   "RecordId": "5",   "data": "data" }

Or, as @peak suggested in comments, eliminate the awk portion of thie entirely by using jq's -c (compact output) option:

$ jq -c '.[][][]' < data.json
{"RecordId":"1","data":"data"}
{"RecordId":"2","data":"data"}
{"RecordId":"3","data":"data"}
{"RecordId":"4","data":"data"}
{"RecordId":"5","data":"data"}
like image 197
ghoti Avatar answered Oct 10 '22 18:10

ghoti


Regarding perl: Try setting the input line separator $/ to }, like this:

#!/usr/bin/perl
$/= "},"; 
while (<>){
   print "$_\n"; 
}'

or, as a one-liner:

$ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat 
like image 32
neuhaus Avatar answered Oct 10 '22 18:10

neuhaus