Is there any write-to-file buffer in bash programming? And if there is any, is it possible to change its size.
Here is the problem
I have a bash script which reads a file line by line then manipulates the read data and then write the result in to another file. Something like this
while read line
some grep, but and sed
echo and append to another file
The input data is really huge (nearly 20GB of text file). The progress is slow so a question arise that if the default behavior of bash, is to write the result into the output file for each read line, then the progress will be slow.
So I want to know, is there any mechanism to buffer some outputs and then write that chunk to file? I searched on the internet about this issue but didn't find any useful information...
Is is an OS related question or bash? The OS is centos release 6.
The script is
#!/bin/bash
BENCH=$1
grep "CPU 0" $BENCH > `pwd`/$BENCH.cpu0
grep -oP '(?<=<[vp]:0x)[0-9a-z]+' `pwd`/$BENCH.cpu0 | sed 'N;s/\n/ /' | tr '[:lower:]' '[:upper:]' > `pwd`/$BENCH.cpu0.data.VP
echo "grep done"
while read line ; do
w1=`echo $line | cut -d ' ' -f1`
w11=`echo "ibase=16; $w1" | bc`
w2=`echo $line | cut -d ' ' -f2`
w22=`echo "ibase=16; $w2" | bc`
echo $w11 $w22 >> `pwd`/$BENCH.cpu0.data.VP.decimal
done <"`pwd`/$BENCH.cpu0.data.VP"
echo "convertion done"
Each echo and append in your loop are opening and closing the file which may have a negative impact on performance.
A likely better approach (and you should profile) is simply:
grep 'foo' | sed 's/bar/baz' | [any other stream operations] <$input_file >$output_file
If you must keep the existing structure then an alternative approach would be to create a named pipe:
mkfifo buffer
Then create 2 processes: one which writes into the pipe, and one with reads from the pipe.
#proc1
while read line <$input_file; do
grep foo | sed 's/bar/baz' >buffer
done
#proc2
while read line <buffer; do
echo line >>$output_file
done
In reality I would expect the bottleneck to be entirely file IO, but this does create an independence between the reading and writing, which may be desirable.
If you have 20GB
of RAM lying around, it may improve performance to use a memory mapped temporary file instead of a named pipe.
Just to see what the differences were, I created a file containing a bunch of
a somewhat long string followed by a number: 0000001
Containing 10,000 lines (about 50MiB) and then ran it through a shell read loop
while read line ; do
echo $line | grep '00$' | cut -d " " -f9 | sed 's/^00*//'
done < data > data.out
Which took almost 6 minutes. Compared with the equivalent
grep '00$' data | cut -d " " -f9 | sed 's/^00*//' > data.fast
which took 0.2 seconds. To remove the cost of the forking, I tested
while read line ; do
:
done < data > data.null
where :
is a shell built-in which does nothing at all. As expected data.null
had no contents and the loop still took 21 seconds to run through my small file. I wanted to test against a 20GB input file, but I'm not that patient.
Conclusion: learn how to use awk
or perl
because you will wait forever if you try to use the script you posted while I was writing this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With