I have a 84 million-line XML that I am processing with 'gawk' in Red Hat Linux. (OK, some people would recommend to use other tools rather than GAWK, but my XML doesn't have multiline tags or any other peculiarities that make GAWK not a good choice for the job.) My concern is about performance. My initial AWK script is something like this: <pre class="prettyprint"><code># Test_1.awk BEGIN {FS = "<|:|=";} { if ($3 == "SubNetwork id") { # do something } } END { # print something } </code></pre> That makes 84 million string comparisons, once every line. I noticed that "SubNetwork id" only appears when there are 4 fields in the line (NF=4), so I changed the script to make fewer string comparisons: <pre class="prettyprint"><code># Test_2.awk BEGIN {FS = "<|:|=";} { if (NF == 4) { if ($3 == "SubNetwork id") { # do something } } } END { # print something } </code></pre> I run it and saw that I was checking 'NF == 4' 84 million times (obvious) and '$3 == "SubNetwork id"' only 3 million times. Great, I had reduced the number of string comparisons, which I've always thought are more time-consuming than simple integer comparisons (NF is an integer, right?). My surprise came when I tested both scripts for performance. Most of the times Test_1 was faster than Test_2. I run them many times to account for other processes that might be using CPU time, but overall my tests were run when the CPU was more or less 'idle'. My brain tells me that 84 million integer comparisons plus 3 million string comparisons must be faster than 84 million string comparisons, but obviously something is wrong with my reasoning. My XML looks like this: <pre class="prettyprint"><code><?xml version="1.0" encoding="UTF-8"?> <ConfigDataFile xmlns:un="specific.xsd" xmlns:xn="generic.xsd"> <configData dnPrefix="Undefined"> <xn:SubNetwork id="ROOT_1"> <xn:SubNetwork id="ROOT_2"> <xn:attributes> ... </xn:attributes> </xn:SubNetwork> <xn:SubNetwork id="ID_1"> .... </xn:SubNetwork> <xn:SubNetwork id="ID_2"> ..... </xn:SubNetwork> </xn:SubNetwork> </configData> </ConfigDataFile> </code></pre> Any help to understand this performance problem would be appreciated. Thanks in advance.

I did more tests: 1- Generate a large file with some data <pre class="prettyprint"><code>yes 'SomeSampleText SomeOtherText 33 1970 YetAnotherText 777 abc 1 AndSomeMore' | head -12000000 > SomeData.txt </code></pre> The delimiter is whitespace! 2- Run these 6 tests, several times, and compute average time for each test. I did it on 3 diferent machines (with Red Hat Linux Enterprise 4) <pre class="prettyprint"><code>time gawk 'BEGIN {a = 0;} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData.txt time gawk 'BEGIN {a = 0;} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData.txt time gawk 'BEGIN {a = 0;} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData.txt time gawk 'BEGIN {a = 0;} {if (NF == 9) a ++;} END {print "a: " a;}' SomeData.txt time gawk 'BEGIN {a = 0;} {if ($1 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData.txt time gawk 'BEGIN {a = 0;} {if ($9 == "AndSomeMore") a ++;} END {print "a: " a;}' SomeData.txt </code></pre> 3- I got these results (numbers are seconds) <pre class="prettyprint"><code>-- Machine 1 10.35 39.39 38.87 10.40 7.72 12.26 -- Machine 2 8.50 32.43 31.83 9.10 6.54 9.91 -- Machine 3 12.35 13.55 12.90 14.40 9.43 14.93 </code></pre> It looks like searching for pattern /YetAnotherText/ in tests 2 and 3 was very slow. Except for Machine 3... 4- Generate another large file with some data with different delimiters <pre class="prettyprint"><code>yes "<SomeSampleText:SomeOtherText=33>1970<YetAnotherText:777=abc>1<AndSomeMore>" | head -12000000 > SomeData2.txt </code></pre> 5- Run 6 tests, changing the FS <pre class="prettyprint"><code>time gawk 'BEGIN {FS = "<|:|=";} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData2.txt time gawk 'BEGIN {FS = "<|:|=";} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData2.txt time gawk 'BEGIN {FS = "<|:|=";} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData2.txt time gawk 'BEGIN {FS = "<|:|=";} {if (NF == 8) a ++;} END {print "a: " a;}' SomeData2.txt time gawk 'BEGIN {FS = "<|:|=";} {if ($2 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData2.txt time gawk 'BEGIN {FS = "<|:|=";} {if ($8 == "AndSomeMore>") a ++;} END {print "a: " a;}' SomeData2.txt </code></pre> 6- I got these results (I only did it for Machine 3, sorry) <pre class="prettyprint"><code>66.17 33.11 32.16 76.77 37.17 77.20 </code></pre> My conclusions (also see comments by @user31264): <ul> <li>It seems that parsing and splitting into fields is faster when there is one simple delimiter, instead of several delimiters.</li> <li>Usually getting $N is faster than getting $M, where N < M</li> <li>In some cases, searching for /pattern/ in the whole line is faster than comparing $N == "pattern", especially if N is not one of the first fields of the line</li> <li>Getting NF can be slow because the line has to be parsed and fields calculated, and more so if there are several delimiters</li> </ul>

AWK/GAWK performance

Tags:

I have a 84 million-line XML that I am processing with 'gawk' in Red Hat Linux. (OK, some people would recommend to use other tools rather than GAWK, but my XML doesn't have multiline tags or any other peculiarities that make GAWK not a good choice for the job.)

My concern is about performance.

My initial AWK script is something like this:

# Test_1.awk
BEGIN {FS = "<|:|=";}
{
if ($3 == "SubNetwork id")
    {
    # do something
    }
}
END {
# print something
}

That makes 84 million string comparisons, once every line.

I noticed that "SubNetwork id" only appears when there are 4 fields in the line (NF=4), so I changed the script to make fewer string comparisons:

# Test_2.awk
BEGIN {FS = "<|:|=";}
{
if (NF == 4)
    {
    if ($3 == "SubNetwork id")
        {
        # do something
        }
    }
}
END {
# print something
}

I run it and saw that I was checking 'NF == 4' 84 million times (obvious) and '$3 == "SubNetwork id"' only 3 million times. Great, I had reduced the number of string comparisons, which I've always thought are more time-consuming than simple integer comparisons (NF is an integer, right?).

My surprise came when I tested both scripts for performance. Most of the times Test_1 was faster than Test_2. I run them many times to account for other processes that might be using CPU time, but overall my tests were run when the CPU was more or less 'idle'.

My brain tells me that 84 million integer comparisons plus 3 million string comparisons must be faster than 84 million string comparisons, but obviously something is wrong with my reasoning.

My XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<ConfigDataFile xmlns:un="specific.xsd" xmlns:xn="generic.xsd">
    <configData dnPrefix="Undefined">
        <xn:SubNetwork id="ROOT_1">
            <xn:SubNetwork id="ROOT_2">
                <xn:attributes>
                ...
                </xn:attributes>
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_1">
            ....
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_2">
            .....
            </xn:SubNetwork>
        </xn:SubNetwork>
    </configData>
</ConfigDataFile>

Any help to understand this performance problem would be appreciated.

Thanks in advance.

965

asked Apr 20 '17 08:04

Mike Duke

2 Answers

I did more tests:

1- Generate a large file with some data

yes 'SomeSampleText SomeOtherText 33 1970 YetAnotherText 777 abc 1 AndSomeMore' | head -12000000 > SomeData.txt

The delimiter is whitespace!

2- Run these 6 tests, several times, and compute average time for each test. I did it on 3 diferent machines (with Red Hat Linux Enterprise 4)

time gawk 'BEGIN {a = 0;} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if (NF == 9) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($1 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($9 == "AndSomeMore") a ++;} END {print "a: " a;}' SomeData.txt

3- I got these results (numbers are seconds)

-- Machine 1
10.35
39.39
38.87
10.40
7.72
12.26

-- Machine 2
8.50
32.43
31.83
9.10
6.54
9.91

-- Machine 3
12.35
13.55
12.90
14.40
9.43
14.93

It looks like searching for pattern /YetAnotherText/ in tests 2 and 3 was very slow. Except for Machine 3...

4- Generate another large file with some data with different delimiters

yes "<SomeSampleText:SomeOtherText=33>1970<YetAnotherText:777=abc>1<AndSomeMore>" | head -12000000 > SomeData2.txt

5- Run 6 tests, changing the FS

time gawk 'BEGIN {FS = "<|:|=";} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if (NF == 8) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($2 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($8 == "AndSomeMore>") a ++;} END {print "a: " a;}' SomeData2.txt

6- I got these results (I only did it for Machine 3, sorry)

My conclusions (also see comments by @user31264):

It seems that parsing and splitting into fields is faster when there is one simple delimiter, instead of several delimiters.
Usually getting $N is faster than getting $M, where N < M
In some cases, searching for /pattern/ in the whole line is faster than comparing $N == "pattern", especially if N is not one of the first fields of the line
Getting NF can be slow because the line has to be parsed and fields calculated, and more so if there are several delimiters

161

answered Sep 21 '22 10:09

Mike Duke

Below is a simple test. The first line outputs 10,000,000 lines "a b c d" into the file a. awk is GNU Awk 4.1.3

[~] yes 'a b c d' | h -10000000 > a
[~] time awk '{if(NF==5)print("a")}' a
2.344u 0.012s 0:02.36 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if(NF==5)print("a")}' a
2.364u 0.008s 0:02.37 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.876u 0.024s 0:02.90 99.6%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.880u 0.020s 0:02.90 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.540u 0.020s 0:02.56 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.404u 0.004s 0:02.41 99.5%     0+0k 0+0io 0pf+0w

As you can see, checking $1 is faster than checking $4, because in the former case AWK needs to parse the line only up to the first word. If you check only NF, AWK only counts words, which in my case was even faster, but in your case it might be slower to count words than to parse the input line up to the 3rd word.

Finally, we can speed up AWK like this:

[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.376u 0.020s 0:01.40 99.2%     0+0k 0+0io 0pf+0w
[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.372u 0.028s 0:01.40 99.2%     0+0k 0+0io 0pf+0w

because /Hahaha/ does not require any parsing.

If you add /SubNetwork id/ before the {, it may speed up things.

If you process only lines with "SuNetwork id" and ignore all the others, you may want to do

grep 'SubNetwork id' your_input_file | awk -f prog.awk

It would speed up things drastically, since grep is much faster than awk.

Finally, one more way to speed up awk is to use mawk, which is much faster than gawk. Unfortunately, sometimes it produces different results than gawk, so it should always be tested.

answered Sep 25 '22 10:09

user31264

Related questions
                            
                                Log4Net SmtpAppender not working in .netCore
                            
                                DbContext creation into message handler
                            
                                spring security + oauth2 + reactjs + restful http client
                            
                                xmlstarlet failed to load external entity
                            
                                Releasing media player in Android ViewPager on swipe
                            
                                ScheduledExecutorService execute every night at 12 AM UTC Time
                            
                                Passport.js local strategy WITHOUT mongoose
                            
                                Resource overallocation to slots in Flink
                            
                                What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings
                            
                                How can I use text-align-last except when I have only one line of text?
                            
                                Exponentiation using list comprehension
                            
                                How use react-rails with webpacker?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWK/GAWK performance

Tags:

Mike Duke

People also ask

2 Answers

Mike Duke

user31264

Recent Activity

Donate For Us