Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

About awk shell and pipe in linux

Tags:

linux

shell

awk

everyone, I am dealing with a log file which has about 5 million lines, so I use the awk shell in linux

I have to grep the domains and get the highest 100 in the log, so I write like this:

          awk '{print $19}' $1 | 
          awk '{ split($0, string, "/");print string[1]}' |
          awk '{domains[$0]++} END{for(j in domains) print domains[j], j}' |
          sort -n | tail -n 100 > $2

it runs about 13 seconds

then I change the script like this:

          awk 'split($19, string, "/"); domains[string[1]]++}
               END{for(j in domains) print domains[j], j}' $1 |
          sort -n | tail -n 100 > $2

it runs about 21 seconds

why?

you know one line of awk shell may reduce the sum of cal, it only read each line once, but the time increase...

so, if you know the answer, tell me

like image 843
Flypig Avatar asked Dec 11 '25 19:12

Flypig


1 Answers

When you pipe commands they run in parallel as long as the pipe is full.

So my guess is that in the first version work is distributed among your CPUs, while in the second one all the work is done by one core.

You can verify this with top (or, better, htop).


Out of curiosity, is this faster? (untested):

cut -f 19 -d' ' $1 | cut -f1 -d'/' | sort | uniq -c | sort -nr | head -n 100 > $2
like image 143
Giacomo Avatar answered Dec 13 '25 07:12

Giacomo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!