Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash: process list of files in chunks

The setting:

I have some hundred files, named something like input0.dat, input1.dat, ..., input150.dat, which I need to process using some command cmd (which basically merges the contents of all files). The cmd takes as first option the output filename and then a list of all input filenames:

./cmd output.dat input1.dat input2.dat [...] input150.dat

The problem:

The problem is that the script can only handle like 10 files or so due to memory issues (don't blame me for that). Thus, instead of using the bash wildcard extension like

./cmd output.dat *dat

I need to do something like

./cmd temp_output0.dat file0.dat file1.dat [...] file9.dat
[...]
./cmd temp_outputN.dat fileN0.dat fileN1.dat [...] fileN9.dat

Afterwards I can merge the temporary outputs.

./cmd output.dat output0.dat [...] outputN.dat

How do I script this efficiently in bash?

I tried, without success, e.g.

for filename in `echo *dat | xargs -n 3`; do [...]; done

The problem is that this again processes all files at once, because the output lines of xargs get concatenated.

EDIT: Note that I need to specify an output filename as first command line argument when calling cmd!

like image 636
fuenfundachtzig Avatar asked Jan 20 '12 17:01

fuenfundachtzig


2 Answers

edit Without a pipe or process substitution - requires bash. This is able to deal with files with spaces in their names. Use a bash array and extract in slices:

i=0
infiles=(*dat)
opfiles=()
while ((${#infiles[@]})); do
    threefiles=("${infiles[@]:0:3}")
    echo ./cmd tmp_output$i.dat "${threefiles[@]}"
    opfiles+=("tmp_output$i.dat")
    ((i++))
    infiles=("${infiles[@]:3}")
done
echo ./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Using a fifo - this is not capable of dealing with spaces in filenames:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

You need to use a fifo to keep the i variable value, as well as for the final concatenation set of files.

If you want, you can background the inside invocation of ./cmd, put a wait before the last invocation of cmd:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles&
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

update If you want to avoid using a fifo entirely, you can use process substitution to emulate it, so rewriting the first one as:

i=0
opfiles=()
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles+=("tmp_output$i.dat")
    ((i++)) 
done < <(echo *dat | xargs -n 3)
./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Again avoiding piping into the while, but reading from a redirection to keep the opfiles variable after the while loop.

like image 93
Petesh Avatar answered Sep 22 '22 23:09

Petesh


Try the following, it should work for you:

echo *dat | xargs -n3 ./cmd output.dat

EDIT: In response to your comment:

for i in {0..9}; do
    echo file${i}*.dat | xargs -n3 ./cmd output${i}.dat
done

That would send no more than three files at a time to ./cmd, while going over all file from file00.dat to file99.dat, and having 10 different output files, output1.dat to output9.dat.

like image 30
drrlvn Avatar answered Sep 20 '22 23:09

drrlvn