I have a 84 million-line XML that I am processing with 'gawk' in Red Hat Linux. (OK, some people would recommend to use other tools rather than GAWK, but my XML doesn't have multiline tags or any other peculiarities that make GAWK not a good choice for the job.)
My concern is about performance.
My initial AWK script is something like this:
# Test_1.awk
BEGIN {FS = "<|:|=";}
{
if ($3 == "SubNetwork id")
{
# do something
}
}
END {
# print something
}
That makes 84 million string comparisons, once every line.
I noticed that "SubNetwork id" only appears when there are 4 fields in the line (NF=4), so I changed the script to make fewer string comparisons:
# Test_2.awk
BEGIN {FS = "<|:|=";}
{
if (NF == 4)
{
if ($3 == "SubNetwork id")
{
# do something
}
}
}
END {
# print something
}
I run it and saw that I was checking 'NF == 4' 84 million times (obvious) and '$3 == "SubNetwork id"' only 3 million times. Great, I had reduced the number of string comparisons, which I've always thought are more time-consuming than simple integer comparisons (NF is an integer, right?).
My surprise came when I tested both scripts for performance. Most of the times Test_1 was faster than Test_2. I run them many times to account for other processes that might be using CPU time, but overall my tests were run when the CPU was more or less 'idle'.
My brain tells me that 84 million integer comparisons plus 3 million string comparisons must be faster than 84 million string comparisons, but obviously something is wrong with my reasoning.
My XML looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<ConfigDataFile xmlns:un="specific.xsd" xmlns:xn="generic.xsd">
<configData dnPrefix="Undefined">
<xn:SubNetwork id="ROOT_1">
<xn:SubNetwork id="ROOT_2">
<xn:attributes>
...
</xn:attributes>
</xn:SubNetwork>
<xn:SubNetwork id="ID_1">
....
</xn:SubNetwork>
<xn:SubNetwork id="ID_2">
.....
</xn:SubNetwork>
</xn:SubNetwork>
</configData>
</ConfigDataFile>
Any help to understand this performance problem would be appreciated.
Thanks in advance.
As for speed, using gawk as "plain" awk should make no difference – often, when gawk is installed, awk will just be a symlink to gawk which means they'll be exactly the same program.
Awk is fast because it has stayed simple and avoided features that are considered necessities in other languages. It concentrates on what it can do well. Several correspondents told me that they appreciated being able to do what they wanted without downloading large modules as they would do for other languages.
There are three main versions of AWK in use today, and all of them conform to the POSIX standard (closely enough, at least, for the vast majority of use cases). The first is classic awk, the version of AWK described by Aho, Weinberger, and Kernighan in their book The AWK Programming Language.
gawk is the GNU implementation of the Awk programming language, first developed for the UNIX operating system in the 1970s. The Awk programming language specializes in dealing with data formatting in text files, particularly text data organized in columns.
I did more tests:
1- Generate a large file with some data
yes 'SomeSampleText SomeOtherText 33 1970 YetAnotherText 777 abc 1 AndSomeMore' | head -12000000 > SomeData.txt
The delimiter is whitespace!
2- Run these 6 tests, several times, and compute average time for each test. I did it on 3 diferent machines (with Red Hat Linux Enterprise 4)
time gawk 'BEGIN {a = 0;} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if (NF == 9) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($1 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($9 == "AndSomeMore") a ++;} END {print "a: " a;}' SomeData.txt
3- I got these results (numbers are seconds)
-- Machine 1
10.35
39.39
38.87
10.40
7.72
12.26
-- Machine 2
8.50
32.43
31.83
9.10
6.54
9.91
-- Machine 3
12.35
13.55
12.90
14.40
9.43
14.93
It looks like searching for pattern /YetAnotherText/ in tests 2 and 3 was very slow. Except for Machine 3...
4- Generate another large file with some data with different delimiters
yes "<SomeSampleText:SomeOtherText=33>1970<YetAnotherText:777=abc>1<AndSomeMore>" | head -12000000 > SomeData2.txt
5- Run 6 tests, changing the FS
time gawk 'BEGIN {FS = "<|:|=";} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if (NF == 8) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($2 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($8 == "AndSomeMore>") a ++;} END {print "a: " a;}' SomeData2.txt
6- I got these results (I only did it for Machine 3, sorry)
66.17
33.11
32.16
76.77
37.17
77.20
My conclusions (also see comments by @user31264):
Below is a simple test. The first line outputs 10,000,000 lines "a b c d" into the file a. awk
is GNU Awk 4.1.3
[~] yes 'a b c d' | h -10000000 > a
[~] time awk '{if(NF==5)print("a")}' a
2.344u 0.012s 0:02.36 99.5% 0+0k 0+0io 0pf+0w
[~] time awk '{if(NF==5)print("a")}' a
2.364u 0.008s 0:02.37 99.5% 0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.876u 0.024s 0:02.90 99.6% 0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.880u 0.020s 0:02.90 100.0% 0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.540u 0.020s 0:02.56 100.0% 0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.404u 0.004s 0:02.41 99.5% 0+0k 0+0io 0pf+0w
As you can see, checking $1 is faster than checking $4, because in the former case AWK needs to parse the line only up to the first word. If you check only NF, AWK only counts words, which in my case was even faster, but in your case it might be slower to count words than to parse the input line up to the 3rd word.
Finally, we can speed up AWK like this:
[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.376u 0.020s 0:01.40 99.2% 0+0k 0+0io 0pf+0w
[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.372u 0.028s 0:01.40 99.2% 0+0k 0+0io 0pf+0w
because /Hahaha/
does not require any parsing.
If you add /SubNetwork id/
before the {
, it may speed up things.
If you process only lines with "SuNetwork id" and ignore all the others, you may want to do
grep 'SubNetwork id' your_input_file | awk -f prog.awk
It would speed up things drastically, since grep is much faster than awk.
Finally, one more way to speed up awk is to use mawk, which is much faster than gawk. Unfortunately, sometimes it produces different results than gawk, so it should always be tested.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With