About awk shell and pipe in linux

Question

everyone, I am dealing with a log file which has about 5 million lines, so I use the awk shell in linux

I have to grep the domains and get the highest 100 in the log, so I write like this:

          awk '{print $19}' $1 | 
          awk '{ split($0, string, "/");print string[1]}' |
          awk '{domains[$0]++} END{for(j in domains) print domains[j], j}' |
          sort -n | tail -n 100 > $2

it runs about 13 seconds

then I change the script like this:

          awk 'split($19, string, "/"); domains[string[1]]++}
               END{for(j in domains) print domains[j], j}' $1 |
          sort -n | tail -n 100 > $2

it runs about 21 seconds

why?

you know one line of awk shell may reduce the sum of cal, it only read each line once, but the time increase...

so, if you know the answer, tell me

Giacomo · Accepted Answer

When you pipe commands they run in parallel as long as the pipe is full.

So my guess is that in the first version work is distributed among your CPUs, while in the second one all the work is done by one core.

You can verify this with top (or, better, htop).

Out of curiosity, is this faster? (untested):

cut -f 19 -d' ' $1 | cut -f1 -d'/' | sort | uniq -c | sort -nr | head -n 100 > $2

About awk shell and pipe in linux

Tags:

linux

shell

awk

Flypig

1 Answers

Giacomo

Recent Activity

Donate For Us

About awk shell and pipe in linux

Tags:

linux

shell

awk

Flypig

1 Answers

Giacomo

Related questions

Recent Activity

Donate For Us