Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace each } with a }\n in a huge (12GB) which consists of 1 line?

I have a log file (from a customer). 18 Gigs. All contents of the file are in 1 line. I want to read the file in logstash. But I get problems because of Memory. The file is read line by line but unfortunately it is all on 1 line.

I tried split the file into lines so that logstash can process it (the file has a simple json format, no nested objects) i wanted to have each json in one line, splitting at } by replacing with }\n:

sed -i 's/}/}\n/g' NonPROD.log.backup

But sed is killed - I assume also because of memory. How can I resolve this? Can I let sed manipulate the file using other chunks of data than lines? I know by default sed reads line by line.

like image 218
user74211 Avatar asked Jun 30 '17 17:06

user74211


2 Answers

The following uses only functionality built into the shell:

#!/bin/bash

# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
  # ...and print that content followed by '}' and a newline.
  printf '%s}\n' "$piece"
done

# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"

If you have logstash configured to read from a TCP port (using 14321 as an arbitrary example below), you can run thescript <NonPROD.log.backup >"/dev/tcp/127.0.0.1/14321" or similar, and there you are -- without needing to have double your original input file's space available on disk, as other answers thus far given require.

like image 83
Charles Duffy Avatar answered Oct 13 '22 00:10

Charles Duffy


With GNU awk for RT:

$ printf 'abc}def}ghi\n' | awk -v RS='}' '{ORS=(RT?"}\n":"")}1'
abc}
def}
ghi

with other awks:

$ printf 'abc}def}ghi\n' | awk -v RS='}' -v ORS='}\n' 'NR>1{print p} {p=$0} END{printf "%s",p}'
abc}
def}
ghi

I decided to test all of the currently posted solutions for functionality and execution time using an input file generated by this command:

awk 'BEGIN{for(i=1;i<=1000000;i++)printf "foo}"; print "foo"}' > file1m

and here's what I got:

1) awk (both awk scripts above had similar results):

time awk -v RS='}' '{ORS=(RT?"}\n":"")}1' file1m

Got expected output, timing =

real    0m0.608s
user    0m0.561s
sys     0m0.045s

2) shell loop:

$ cat tst.sh
#!/bin/bash

# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
  # ...and print that content followed by '}' and a newline.
  printf '%s}\n' "$piece"
done

# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"

$ time ./tst.sh < file1m

Got expected output, timing =

real    1m52.152s
user    1m18.233s
sys     0m32.604s

3) tr+sed:

$ time tr '}' '\n' < file1m | sed 's/$/}/'

Did not produce the expected output (Added an undesirable } at the end of the file), timing =

real    0m0.577s
user    0m0.468s
sys     0m0.078s

With a tweak to remove that final undesirable }:

$ time tr '}' '\n' < file1m | sed 's/$/}/; $s/}//'

real    0m0.718s
user    0m0.670s
sys     0m0.108s

4) fold+sed+tr:

$ time fold -w 1000 file1m | sed 's/}/}\n\n/g' | tr -s '\n'

Got expected output, timing =

real    0m0.811s
user    0m1.137s
sys     0m0.076s

5) split+sed+cat:

$ cat tst2.sh
mkdir tmp$$
pwd="$(pwd)"
cd "tmp$$"
split -b 1m "${pwd}/${1}"
sed -i 's/}/}\n/g' x*
cat x*
rm -f x*
cd "$pwd"
rmdir tmp$$

$ time ./tst2.sh file1m

Got expected output, timing =

real    0m0.983s
user    0m0.685s
sys     0m0.167s
like image 3
Ed Morton Avatar answered Oct 13 '22 01:10

Ed Morton