I have a log file (from a customer). 18 Gigs. All contents of the file are in 1 line. I want to read the file in logstash. But I get problems because of Memory. The file is read line by line but unfortunately it is all on 1 line.
I tried split the file into lines so that logstash can process it (the file has a simple json format, no nested objects) i wanted to have each json in one line, splitting at }
by replacing with }\n
:
sed -i 's/}/}\n/g' NonPROD.log.backup
But sed
is killed - I assume also because of memory. How can I resolve this? Can I let sed
manipulate the file using other chunks of data than lines? I know by default sed
reads line by line.
The following uses only functionality built into the shell:
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
If you have logstash configured to read from a TCP port (using 14321
as an arbitrary example below), you can run thescript <NonPROD.log.backup >"/dev/tcp/127.0.0.1/14321"
or similar, and there you are -- without needing to have double your original input file's space available on disk, as other answers thus far given require.
With GNU awk for RT
:
$ printf 'abc}def}ghi\n' | awk -v RS='}' '{ORS=(RT?"}\n":"")}1'
abc}
def}
ghi
with other awks:
$ printf 'abc}def}ghi\n' | awk -v RS='}' -v ORS='}\n' 'NR>1{print p} {p=$0} END{printf "%s",p}'
abc}
def}
ghi
I decided to test all of the currently posted solutions for functionality and execution time using an input file generated by this command:
awk 'BEGIN{for(i=1;i<=1000000;i++)printf "foo}"; print "foo"}' > file1m
and here's what I got:
1) awk (both awk scripts above had similar results):
time awk -v RS='}' '{ORS=(RT?"}\n":"")}1' file1m
Got expected output, timing =
real 0m0.608s
user 0m0.561s
sys 0m0.045s
2) shell loop:
$ cat tst.sh
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
$ time ./tst.sh < file1m
Got expected output, timing =
real 1m52.152s
user 1m18.233s
sys 0m32.604s
3) tr+sed:
$ time tr '}' '\n' < file1m | sed 's/$/}/'
Did not produce the expected output (Added an undesirable }
at the end of the file), timing =
real 0m0.577s
user 0m0.468s
sys 0m0.078s
With a tweak to remove that final undesirable }
:
$ time tr '}' '\n' < file1m | sed 's/$/}/; $s/}//'
real 0m0.718s
user 0m0.670s
sys 0m0.108s
4) fold+sed+tr:
$ time fold -w 1000 file1m | sed 's/}/}\n\n/g' | tr -s '\n'
Got expected output, timing =
real 0m0.811s
user 0m1.137s
sys 0m0.076s
5) split+sed+cat:
$ cat tst2.sh
mkdir tmp$$
pwd="$(pwd)"
cd "tmp$$"
split -b 1m "${pwd}/${1}"
sed -i 's/}/}\n/g' x*
cat x*
rm -f x*
cd "$pwd"
rmdir tmp$$
$ time ./tst2.sh file1m
Got expected output, timing =
real 0m0.983s
user 0m0.685s
sys 0m0.167s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With