Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWK/GAWK performance

Tags:

I have a 84 million-line XML that I am processing with 'gawk' in Red Hat Linux. (OK, some people would recommend to use other tools rather than GAWK, but my XML doesn't have multiline tags or any other peculiarities that make GAWK not a good choice for the job.)

My concern is about performance.

My initial AWK script is something like this:

# Test_1.awk
BEGIN {FS = "<|:|=";}
{
if ($3 == "SubNetwork id")
    {
    # do something
    }
}
END {
# print something
}

That makes 84 million string comparisons, once every line.

I noticed that "SubNetwork id" only appears when there are 4 fields in the line (NF=4), so I changed the script to make fewer string comparisons:

# Test_2.awk
BEGIN {FS = "<|:|=";}
{
if (NF == 4)
    {
    if ($3 == "SubNetwork id")
        {
        # do something
        }
    }
}
END {
# print something
}

I run it and saw that I was checking 'NF == 4' 84 million times (obvious) and '$3 == "SubNetwork id"' only 3 million times. Great, I had reduced the number of string comparisons, which I've always thought are more time-consuming than simple integer comparisons (NF is an integer, right?).

My surprise came when I tested both scripts for performance. Most of the times Test_1 was faster than Test_2. I run them many times to account for other processes that might be using CPU time, but overall my tests were run when the CPU was more or less 'idle'.

My brain tells me that 84 million integer comparisons plus 3 million string comparisons must be faster than 84 million string comparisons, but obviously something is wrong with my reasoning.

My XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<ConfigDataFile xmlns:un="specific.xsd" xmlns:xn="generic.xsd">
    <configData dnPrefix="Undefined">
        <xn:SubNetwork id="ROOT_1">
            <xn:SubNetwork id="ROOT_2">
                <xn:attributes>
                ...
                </xn:attributes>
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_1">
            ....
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_2">
            .....
            </xn:SubNetwork>
        </xn:SubNetwork>
    </configData>
</ConfigDataFile>

Any help to understand this performance problem would be appreciated.

Thanks in advance.

like image 965
Mike Duke Avatar asked Apr 20 '17 08:04

Mike Duke


People also ask

Is gawk faster than awk?

As for speed, using gawk as "plain" awk should make no difference – often, when gawk is installed, awk will just be a symlink to gawk which means they'll be exactly the same program.

Why is awk so fast?

Awk is fast because it has stayed simple and avoided features that are considered necessities in other languages. It concentrates on what it can do well. Several correspondents told me that they appreciated being able to do what they wanted without downloading large modules as they would do for other languages.

Is awk still used today?

There are three main versions of AWK in use today, and all of them conform to the POSIX standard (closely enough, at least, for the vast majority of use cases). The first is classic awk, the version of AWK described by Aho, Weinberger, and Kernighan in their book The AWK Programming Language.

Is awk the same as gawk?

gawk is the GNU implementation of the Awk programming language, first developed for the UNIX operating system in the 1970s. The Awk programming language specializes in dealing with data formatting in text files, particularly text data organized in columns.


2 Answers

I did more tests:

1- Generate a large file with some data

yes 'SomeSampleText SomeOtherText 33 1970 YetAnotherText 777 abc 1 AndSomeMore' | head -12000000 > SomeData.txt

The delimiter is whitespace!

2- Run these 6 tests, several times, and compute average time for each test. I did it on 3 diferent machines (with Red Hat Linux Enterprise 4)

time gawk 'BEGIN {a = 0;} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if (NF == 9) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($1 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($9 == "AndSomeMore") a ++;} END {print "a: " a;}' SomeData.txt

3- I got these results (numbers are seconds)

-- Machine 1
10.35
39.39
38.87
10.40
7.72
12.26

-- Machine 2
8.50
32.43
31.83
9.10
6.54
9.91

-- Machine 3
12.35
13.55
12.90
14.40
9.43
14.93

It looks like searching for pattern /YetAnotherText/ in tests 2 and 3 was very slow. Except for Machine 3...

4- Generate another large file with some data with different delimiters

yes "<SomeSampleText:SomeOtherText=33>1970<YetAnotherText:777=abc>1<AndSomeMore>" | head -12000000 > SomeData2.txt

5- Run 6 tests, changing the FS

time gawk 'BEGIN {FS = "<|:|=";} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if (NF == 8) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($2 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($8 == "AndSomeMore>") a ++;} END {print "a: " a;}' SomeData2.txt

6- I got these results (I only did it for Machine 3, sorry)

66.17
33.11
32.16
76.77
37.17
77.20

My conclusions (also see comments by @user31264):

  • It seems that parsing and splitting into fields is faster when there is one simple delimiter, instead of several delimiters.
  • Usually getting $N is faster than getting $M, where N < M
  • In some cases, searching for /pattern/ in the whole line is faster than comparing $N == "pattern", especially if N is not one of the first fields of the line
  • Getting NF can be slow because the line has to be parsed and fields calculated, and more so if there are several delimiters
like image 161
Mike Duke Avatar answered Sep 21 '22 10:09

Mike Duke


Below is a simple test. The first line outputs 10,000,000 lines "a b c d" into the file a. awk is GNU Awk 4.1.3

[~] yes 'a b c d' | h -10000000 > a
[~] time awk '{if(NF==5)print("a")}' a
2.344u 0.012s 0:02.36 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if(NF==5)print("a")}' a
2.364u 0.008s 0:02.37 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.876u 0.024s 0:02.90 99.6%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.880u 0.020s 0:02.90 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.540u 0.020s 0:02.56 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.404u 0.004s 0:02.41 99.5%     0+0k 0+0io 0pf+0w

As you can see, checking $1 is faster than checking $4, because in the former case AWK needs to parse the line only up to the first word. If you check only NF, AWK only counts words, which in my case was even faster, but in your case it might be slower to count words than to parse the input line up to the 3rd word.

Finally, we can speed up AWK like this:

[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.376u 0.020s 0:01.40 99.2%     0+0k 0+0io 0pf+0w
[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.372u 0.028s 0:01.40 99.2%     0+0k 0+0io 0pf+0w

because /Hahaha/ does not require any parsing.

If you add /SubNetwork id/ before the {, it may speed up things.

If you process only lines with "SuNetwork id" and ignore all the others, you may want to do

grep 'SubNetwork id' your_input_file | awk -f prog.awk

It would speed up things drastically, since grep is much faster than awk.

Finally, one more way to speed up awk is to use mawk, which is much faster than gawk. Unfortunately, sometimes it produces different results than gawk, so it should always be tested.

like image 23
user31264 Avatar answered Sep 25 '22 10:09

user31264