Replace each } with a }
in a huge (12GB) which consists of 1 line?

Question

I have a log file (from a customer). 18 Gigs. All contents of the file are in 1 line. I want to read the file in logstash. But I get problems because of Memory. The file is read line by line but unfortunately it is all on 1 line.

I tried split the file into lines so that logstash can process it (the file has a simple json format, no nested objects) i wanted to have each json in one line, splitting at } by replacing with }:

sed -i 's/}/}
/g' NonPROD.log.backup

But sed is killed - I assume also because of memory. How can I resolve this? Can I let sed manipulate the file using other chunks of data than lines? I know by default sed reads line by line.

Charles Duffy · Accepted Answer

The following uses only functionality built into the shell:

#!/bin/bash

# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
  # ...and print that content followed by '}' and a newline.
  printf '%s}
' "$piece"
done

# print any trailing content after the last }
[[ $piece ]] && printf '%s
' "$piece"

If you have logstash configured to read from a TCP port (using 14321 as an arbitrary example below), you can run thescript <NonPROD.log.backup >"/dev/tcp/127.0.0.1/14321" or similar, and there you are -- without needing to have double your original input file's space available on disk, as other answers thus far given require.

Ed Morton · Answer

With GNU awk for RT:

$ printf 'abc}def}ghi
' | awk -v RS='}' '{ORS=(RT?"}
":"")}1'
abc}
def}
ghi

with other awks:

$ printf 'abc}def}ghi
' | awk -v RS='}' -v ORS='}
' 'NR>1{print p} {p=$0} END{printf "%s",p}'
abc}
def}
ghi

I decided to test all of the currently posted solutions for functionality and execution time using an input file generated by this command:

awk 'BEGIN{for(i=1;i<=1000000;i++)printf "foo}"; print "foo"}' > file1m

and here's what I got:

1) awk (both awk scripts above had similar results):

time awk -v RS='}' '{ORS=(RT?"}
":"")}1' file1m

Got expected output, timing =

real    0m0.608s
user    0m0.561s
sys     0m0.045s

2) shell loop:

$ cat tst.sh
#!/bin/bash

# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
  # ...and print that content followed by '}' and a newline.
  printf '%s}
' "$piece"
done

# print any trailing content after the last }
[[ $piece ]] && printf '%s
' "$piece"

$ time ./tst.sh < file1m

Got expected output, timing =

real    1m52.152s
user    1m18.233s
sys     0m32.604s

3) tr+sed:

$ time tr '}' '
' < file1m | sed 's/$/}/'

Did not produce the expected output (Added an undesirable } at the end of the file), timing =

real    0m0.577s
user    0m0.468s
sys     0m0.078s

With a tweak to remove that final undesirable }:

$ time tr '}' '
' < file1m | sed 's/$/}/; $s/}//'

real    0m0.718s
user    0m0.670s
sys     0m0.108s

4) fold+sed+tr:

$ time fold -w 1000 file1m | sed 's/}/}

/g' | tr -s '
'

Got expected output, timing =

real    0m0.811s
user    0m1.137s
sys     0m0.076s

5) split+sed+cat:

$ cat tst2.sh
mkdir tmp$$
pwd="$(pwd)"
cd "tmp$$"
split -b 1m "${pwd}/${1}"
sed -i 's/}/}
/g' x*
cat x*
rm -f x*
cd "$pwd"
rmdir tmp$$

$ time ./tst2.sh file1m

Got expected output, timing =

real    0m0.983s
user    0m0.685s
sys     0m0.167s

Replace each } with a }\n in a huge (12GB) which consists of 1 line?

Tags:

bash

logging

sed

logstash

user74211

2 Answers

Charles Duffy

Ed Morton

Recent Activity

Donate For Us

Replace each } with a }\n in a huge (12GB) which consists of 1 line?

Tags:

bash

logging

sed

logstash

user74211

2 Answers

Charles Duffy

Ed Morton

Related questions

Recent Activity

Donate For Us