everyone, I am dealing with a log file which has about 5 million lines, so I use the awk shell in linux
I have to grep the domains and get the highest 100 in the log, so I write like this:
awk '{print $19}' $1 |
awk '{ split($0, string, "/");print string[1]}' |
awk '{domains[$0]++} END{for(j in domains) print domains[j], j}' |
sort -n | tail -n 100 > $2
it runs about 13 seconds
then I change the script like this:
awk 'split($19, string, "/"); domains[string[1]]++}
END{for(j in domains) print domains[j], j}' $1 |
sort -n | tail -n 100 > $2
it runs about 21 seconds
why?
you know one line of awk shell may reduce the sum of cal, it only read each line once, but the time increase...
so, if you know the answer, tell me
When you pipe commands they run in parallel as long as the pipe is full.
So my guess is that in the first version work is distributed among your CPUs, while in the second one all the work is done by one core.
You can verify this with top (or, better, htop).
Out of curiosity, is this faster? (untested):
cut -f 19 -d' ' $1 | cut -f1 -d'/' | sort | uniq -c | sort -nr | head -n 100 > $2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With