Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any write buffer in bash programming?

Is there any write-to-file buffer in bash programming? And if there is any, is it possible to change its size.

Here is the problem

I have a bash script which reads a file line by line then manipulates the read data and then write the result in to another file. Something like this

while read line 
  some grep, but and sed
  echo and append to another file

The input data is really huge (nearly 20GB of text file). The progress is slow so a question arise that if the default behavior of bash, is to write the result into the output file for each read line, then the progress will be slow.

So I want to know, is there any mechanism to buffer some outputs and then write that chunk to file? I searched on the internet about this issue but didn't find any useful information...

Is is an OS related question or bash? The OS is centos release 6.

The script is

#!/bin/bash
BENCH=$1
grep "CPU  0" $BENCH > `pwd`/$BENCH.cpu0
grep -oP '(?<=<[vp]:0x)[0-9a-z]+' `pwd`/$BENCH.cpu0 | sed 'N;s/\n/ /' |  tr '[:lower:]' '[:upper:]' > `pwd`/$BENCH.cpu0.data.VP
echo "grep done"
while read line ; do
   w1=`echo $line | cut -d ' ' -f1`
   w11=`echo "ibase=16; $w1" | bc`
   w2=`echo $line | cut -d ' ' -f2`
   w22=`echo "ibase=16; $w2" | bc`
   echo $w11 $w22 >> `pwd`/$BENCH.cpu0.data.VP.decimal
done <"`pwd`/$BENCH.cpu0.data.VP"
echo "convertion done"
like image 592
mahmood Avatar asked May 29 '13 15:05

mahmood


2 Answers

Each echo and append in your loop are opening and closing the file which may have a negative impact on performance.

A likely better approach (and you should profile) is simply:

grep 'foo' | sed 's/bar/baz' | [any other stream operations] <$input_file >$output_file 

If you must keep the existing structure then an alternative approach would be to create a named pipe:

mkfifo buffer

Then create 2 processes: one which writes into the pipe, and one with reads from the pipe.

#proc1
while read line <$input_file; do
    grep foo | sed 's/bar/baz' >buffer
done


#proc2
while read line <buffer; do
    echo line >>$output_file
done

In reality I would expect the bottleneck to be entirely file IO, but this does create an independence between the reading and writing, which may be desirable.

If you have 20GB of RAM lying around, it may improve performance to use a memory mapped temporary file instead of a named pipe.

like image 63
cmh Avatar answered Oct 13 '22 10:10

cmh


Just to see what the differences were, I created a file containing a bunch of

a somewhat long string followed by a number: 0000001

Containing 10,000 lines (about 50MiB) and then ran it through a shell read loop

while read line ; do
  echo $line | grep '00$' | cut -d " " -f9 | sed 's/^00*//'
done < data > data.out

Which took almost 6 minutes. Compared with the equivalent

grep '00$' data | cut -d " " -f9 | sed 's/^00*//' > data.fast

which took 0.2 seconds. To remove the cost of the forking, I tested

while read line ; do
  :
done < data > data.null

where : is a shell built-in which does nothing at all. As expected data.null had no contents and the loop still took 21 seconds to run through my small file. I wanted to test against a 20GB input file, but I'm not that patient.

Conclusion: learn how to use awk or perl because you will wait forever if you try to use the script you posted while I was writing this.

like image 26
msw Avatar answered Oct 13 '22 10:10

msw