This question has been discussed here on Meta and my answer give links to a test system to answer this.
The question often comes up about whether to use gawk or mawk or C or some other language due to performance so let's create a canonical question/answer for a trivial and typical awk program.
The result of this will be an answer that provides a comparison of the performance of different tools performing the basic text processing tasks of regexp matching and field splitting on a simple input file. If tool X is twice as fast as every other tool for this task then that is useful information. If all the tools take about the same amount of time then that is useful information too.
The way this will work is that over the next couple of days many people will contribute "answers" which are the programs to be tested and then one person (volunteers?) will test all of them on one platform (or a few people will test some subset on their platform so we can compare) and then all of the results will be collected into a single answer.
Given a 10 Million line input file created by this script:
$ awk 'BEGIN{for (i=1;i<=10000000;i++) print (i%5?"miss":"hit"),i," third\t \tfourth"}' > file
$ wc -l file
10000000 file
$ head -10 file
miss 1 third fourth
miss 2 third fourth
miss 3 third fourth
miss 4 third fourth
hit 5 third fourth
miss 6 third fourth
miss 7 third fourth
miss 8 third fourth
miss 9 third fourth
hit 10 third fourth
and given this awk script which prints the 4th then 1st then 3rd field of every line that starts with "hit" followed by an even number:
$ cat tst.awk
/hit [[:digit:]]*0 / { print $4, $1, $3 }
Here are the first 5 lines of expected output:
$ awk -f tst.awk file | head -5
fourth hit third
fourth hit third
fourth hit third
fourth hit third
fourth hit third
and here is the result when piped to a 2nd awk script to verify that the main script above is actually functioning exactly as intended:
$ awk -f tst.awk file |
awk '!seen[$0]++{unq++;r=$0} END{print ((unq==1) && (seen[r]==1000000) && (r=="fourth hit third")) ? "PASS" : "FAIL"}'
PASS
Here are the timing results of the 3rd execution of gawk 4.1.1 running in bash 4.3.33 on cygwin64:
$ time awk -f tst.awk file > /dev/null
real 0m4.711s
user 0m4.555s
sys 0m0.108s
Note the above is the 3rd execution to remove caching differences.
Can anyone provide the equivalent C, perl, python, whatever code to this:
$ cat tst.awk
/hit [[:digit:]]*0 / { print $4, $1, $3 }
i.e. find THAT REGEXP on a line (we're not looking for some other solution that works around the need for a regexp), split the line at each series of contiguous white space and print the 4th, then 1st, then 3rd fields separated by a single blank char?
If so we can test them all on one platform to see/record the performance differences.
The code contributed so far:
AWK (can be tested against gawk, etc. but mawk, nawk and perhaps others will require [0-9] instead of [:digit:])
awk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file
PHP
php -R 'if(preg_match("/hit \d*0 /", $argn)){$f=preg_split("/\s+/", $argn); echo $f[3]." ".$f[0]." ".$f[2];}' < file
shell
egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}'
grep --mmap -E "^hit [[:digit:]]*0 " file | awk '{print $4, $1, $3 }'
Ruby
$ cat tst.rb
File.open("file").readlines.each do |line|
line.gsub(/(hit)\s[0-9]*0\s+(.*?)\s+(.*)/) { puts "#{$3} #{$1} #{$2}" }
end
$ ruby tst.rb
Perl
$ cat tst.pl
#!/usr/bin/perl -nl
# A solution much like the Ruby one but with atomic grouping
print "$4 $1 $3" if /^(hit)(?>\s+)(\d*0)(?>\s+)((?>[^\s]+))(?>\s+)(?>([^\s]+))$/
$ perl tst.pl file
Python
none yet
C
none yet
Applying egrep before awk gives a great speedup:
paul@home ~ % wc -l file
10000000 file
paul@home ~ % for i in {1..5}; do time egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}' | wc -l ; done
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.63s user 0.02s system 85% cpu 0.759 total
awk '{print $4, $1, $3}' 0.70s user 0.01s system 93% cpu 0.760 total
wc -l 0.00s user 0.02s system 2% cpu 0.760 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.65s user 0.01s system 85% cpu 0.770 total
awk '{print $4, $1, $3}' 0.71s user 0.01s system 93% cpu 0.771 total
wc -l 0.00s user 0.02s system 2% cpu 0.771 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.64s user 0.02s system 82% cpu 0.806 total
awk '{print $4, $1, $3}' 0.73s user 0.01s system 91% cpu 0.807 total
wc -l 0.02s user 0.00s system 2% cpu 0.807 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.63s user 0.02s system 86% cpu 0.745 total
awk '{print $4, $1, $3}' 0.69s user 0.01s system 92% cpu 0.746 total
wc -l 0.00s user 0.02s system 2% cpu 0.746 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.62s user 0.02s system 88% cpu 0.727 total
awk '{print $4, $1, $3}' 0.67s user 0.01s system 93% cpu 0.728 total
wc -l 0.00s user 0.02s system 2% cpu 0.728 total
versus:
paul@home ~ % for i in {1..5}; do time gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.46s user 0.04s system 97% cpu 2.548 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.43s user 0.03s system 98% cpu 2.508 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.40s user 0.04s system 98% cpu 2.489 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.38s user 0.04s system 98% cpu 2.463 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.39s user 0.03s system 98% cpu 2.465 total
'nawk' is even slower!
paul@home ~ % for i in {1..5}; do time nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.05s user 0.06s system 92% cpu 6.606 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.11s user 0.05s system 96% cpu 6.401 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 5.78s user 0.04s system 97% cpu 5.975 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 5.71s user 0.04s system 98% cpu 5.857 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.34s user 0.05s system 93% cpu 6.855 total
On OSX Yosemite
time bash -c 'grep --mmap -E "^hit [[:digit:]]*0 " file | awk '\''{print $4, $1, $3 }'\''' >/dev/null
real 0m5.741s
user 0m6.668s
sys 0m0.112s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With