Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash 'while read line' efficiency with big file

I was using a while loop to process a task,

which read records from a big file about 10 million lines.

I found that the processing become more and more slower as time goes by.

and I make a simulated script with 1 million lines as blow, which reveal the problem.

but I still don't know why, how does the read command work?

seq 1000000 > seq.dat
while read s;
do
    if [ `expr $s % 50000` -eq 0 ];then
        echo -n $( expr `date +%s` - $A) ' ';
        A=`date +%s`;
    fi
done < seq.dat

The terminal outputs the time interval:

98 98 98 98 98 97 98 97 98 101 106 112 121 121 127 132 135 134

at about 50,000 lines,the processing become slower obviously.

like image 618
leemzoon Avatar asked Apr 28 '12 14:04

leemzoon


1 Answers

Using your code, I saw the same pattern of increasing times (right from the beginning!). If you want faster processing, you should rewrite using shell internal features. Here's my bash version:

tabChar="   "  # put a real tab char here, of course
seq 1000000 > seq.dat
while read s;
do
    if (( ! ( s % 50000 ) )) ;then
        echo $s "${tabChar}" $( expr `date +%s` - $A) 
        A=$(date +%s);
    fi
done < seq.dat

edit fixed bug, output indicated each line was being processed, now only every 50000'th line gets the timing treatment. Doah!

was

  if ((  s % 50000 )) ;then

fixed to

  if (( ! ( s % 50000 ) )) ;then

output now echo ${.sh.version} = Version JM 93t+ 2010-05-24

50000
100000   1
150000   0
200000   1
250000   0
300000   1
350000   0
400000   1
450000   0
500000   1
550000   0
600000   1
650000   0
700000   1
750000   0

output bash

50000    480
100000   3
150000   2
200000   3
250000   3
300000   2
350000   3
400000   3
450000   2
500000   2
550000   3
600000   2
650000   2
700000   3
750000   3
800000   2
850000   2
900000   3
950000   2
800000   1
850000   0
900000   1
950000   0
1e+06    1

As to why your original test case is taking so long ... not sure. I was surprised to see both the time for each test cyle AND the increase in time. If you really need to understand this, you may need to spend time instrumenting more test stuff. Maybe you'd see something running truss or strace (depending on your base OS).

I hope this helps.

like image 175
shellter Avatar answered Sep 20 '22 11:09

shellter