Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the performance difference between gawk and ....? [closed]

Tags:

c

awk

perl

This question has been discussed here on Meta and my answer give links to a test system to answer this.


The question often comes up about whether to use gawk or mawk or C or some other language due to performance so let's create a canonical question/answer for a trivial and typical awk program.

The result of this will be an answer that provides a comparison of the performance of different tools performing the basic text processing tasks of regexp matching and field splitting on a simple input file. If tool X is twice as fast as every other tool for this task then that is useful information. If all the tools take about the same amount of time then that is useful information too.

The way this will work is that over the next couple of days many people will contribute "answers" which are the programs to be tested and then one person (volunteers?) will test all of them on one platform (or a few people will test some subset on their platform so we can compare) and then all of the results will be collected into a single answer.

Given a 10 Million line input file created by this script:

$ awk 'BEGIN{for (i=1;i<=10000000;i++) print (i%5?"miss":"hit"),i,"  third\t \tfourth"}' > file

$ wc -l file
10000000 file

$ head -10 file
miss 1   third          fourth
miss 2   third          fourth
miss 3   third          fourth
miss 4   third          fourth
hit 5   third           fourth
miss 6   third          fourth
miss 7   third          fourth
miss 8   third          fourth
miss 9   third          fourth
hit 10   third          fourth

and given this awk script which prints the 4th then 1st then 3rd field of every line that starts with "hit" followed by an even number:

$ cat tst.awk
/hit [[:digit:]]*0 / { print $4, $1, $3 }

Here are the first 5 lines of expected output:

$ awk -f tst.awk file | head -5
fourth hit third
fourth hit third
fourth hit third
fourth hit third
fourth hit third

and here is the result when piped to a 2nd awk script to verify that the main script above is actually functioning exactly as intended:

$ awk -f tst.awk file |
awk '!seen[$0]++{unq++;r=$0} END{print ((unq==1) && (seen[r]==1000000) && (r=="fourth hit third")) ? "PASS" : "FAIL"}'
PASS

Here are the timing results of the 3rd execution of gawk 4.1.1 running in bash 4.3.33 on cygwin64:

$ time awk -f tst.awk file > /dev/null
real    0m4.711s
user    0m4.555s
sys     0m0.108s

Note the above is the 3rd execution to remove caching differences.

Can anyone provide the equivalent C, perl, python, whatever code to this:

$ cat tst.awk
/hit [[:digit:]]*0 / { print $4, $1, $3 }

i.e. find THAT REGEXP on a line (we're not looking for some other solution that works around the need for a regexp), split the line at each series of contiguous white space and print the 4th, then 1st, then 3rd fields separated by a single blank char?

If so we can test them all on one platform to see/record the performance differences.


The code contributed so far:

AWK (can be tested against gawk, etc. but mawk, nawk and perhaps others will require [0-9] instead of [:digit:])

awk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file

PHP

php -R 'if(preg_match("/hit \d*0 /", $argn)){$f=preg_split("/\s+/", $argn); echo $f[3]." ".$f[0]." ".$f[2];}' < file

shell

egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}'
grep --mmap -E "^hit [[:digit:]]*0 " file | awk '{print $4, $1, $3 }'

Ruby

$ cat tst.rb
File.open("file").readlines.each do |line|
  line.gsub(/(hit)\s[0-9]*0\s+(.*?)\s+(.*)/) { puts "#{$3} #{$1} #{$2}" }
end
$ ruby tst.rb

Perl

$ cat tst.pl
#!/usr/bin/perl -nl
# A solution much like the Ruby one but with atomic grouping
print "$4 $1 $3" if /^(hit)(?>\s+)(\d*0)(?>\s+)((?>[^\s]+))(?>\s+)(?>([^\s]+))$/
$ perl tst.pl file

Python

none yet

C

none yet
like image 731
Ed Morton Avatar asked Apr 23 '15 14:04

Ed Morton


2 Answers

Applying egrep before awk gives a great speedup:

paul@home ~ % wc -l file
    10000000 file
paul@home ~ % for i in {1..5}; do time egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}' | wc -l ; done
    1000000
    egrep --color=auto 'hit [[:digit:]]*0 ' file  0.63s user 0.02s system 85% cpu 0.759 total
    awk '{print $4, $1, $3}'  0.70s user 0.01s system 93% cpu 0.760 total
    wc -l  0.00s user 0.02s system 2% cpu 0.760 total
    1000000
    egrep --color=auto 'hit [[:digit:]]*0 ' file  0.65s user 0.01s system 85% cpu 0.770 total
    awk '{print $4, $1, $3}'  0.71s user 0.01s system 93% cpu 0.771 total
    wc -l  0.00s user 0.02s system 2% cpu 0.771 total
    1000000
    egrep --color=auto 'hit [[:digit:]]*0 ' file  0.64s user 0.02s system 82% cpu 0.806 total
    awk '{print $4, $1, $3}'  0.73s user 0.01s system 91% cpu 0.807 total
    wc -l  0.02s user 0.00s system 2% cpu 0.807 total
    1000000
    egrep --color=auto 'hit [[:digit:]]*0 ' file  0.63s user 0.02s system 86% cpu 0.745 total
    awk '{print $4, $1, $3}'  0.69s user 0.01s system 92% cpu 0.746 total
    wc -l  0.00s user 0.02s system 2% cpu 0.746 total
    1000000
    egrep --color=auto 'hit [[:digit:]]*0 ' file  0.62s user 0.02s system 88% cpu 0.727 total
    awk '{print $4, $1, $3}'  0.67s user 0.01s system 93% cpu 0.728 total
    wc -l  0.00s user 0.02s system 2% cpu 0.728 total

versus:

paul@home ~ % for i in {1..5}; do time gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done
    gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  2.46s user 0.04s system 97% cpu 2.548 total
    gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  2.43s user 0.03s system 98% cpu 2.508 total
    gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  2.40s user 0.04s system 98% cpu 2.489 total
    gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  2.38s user 0.04s system 98% cpu 2.463 total
    gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  2.39s user 0.03s system 98% cpu 2.465 total

'nawk' is even slower!

paul@home ~ % for i in {1..5}; do time nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done                                          
    nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  6.05s user 0.06s system 92% cpu 6.606 total
    nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  6.11s user 0.05s system 96% cpu 6.401 total
    nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  5.78s user 0.04s system 97% cpu 5.975 total
    nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  5.71s user 0.04s system 98% cpu 5.857 total
    nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null  6.34s user 0.05s system 93% cpu 6.855 total
like image 108
paulw1128 Avatar answered Sep 19 '22 10:09

paulw1128


On OSX Yosemite

time bash -c 'grep --mmap -E "^hit [[:digit:]]*0 " file | awk '\''{print $4, $1, $3 }'\''' >/dev/null


real    0m5.741s
user    0m6.668s
sys     0m0.112s
like image 28
Mark Setchell Avatar answered Sep 18 '22 10:09

Mark Setchell