Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient way to join array with separators in awk

In awk scripts, I see several methods used to concatenate an array of strings with a separator, mainly trinary being used in some way, or the separator is changed on the fly. Occasionally printf/sprintf is used but I won't consider that here as I'd be surprised if the function call is not expensive.

For example, given an awk array a of strings, with integer indices from 1 to max, and separator sep:

trinary:

j=""
for ( i=1; i<=max; i++ )
    j = ( j ? j sep a[i] : a[i] )

modify:

j=s=""
for ( i=1; i<=max; i++ ) {
    j = j s a[i]
    s = sep
}

I have just come up with this method which I don't recall seeing before (ed: from subsequent comments, this turns out to be similar to a method used by awklib):

mine (for):

j = a[1]
for ( i=1; i<max; ) 
    j = j sep a[++i]

Naively, I would assume this to be more efficient as:

  • there are no tests other than for the loop counter;
  • there are no modifications to the separator; and
  • the redundant final test is elided.

Another option I just thought of:

mine (while):

j=a[i=max]
while (i>1)
    j = a[--i] sep j

Running a simple benchmark of gawk and mawk on a mostly idle Ubuntu 20.04 laptop (9 trials each of 1,000,000 iterations of joining a 64-element array of 50-char strings) gave me:

gawk

algorithm min max median mean
trinary 13.863 15.214 14.148 14.330
modify 11.052 11.926 11.360 11.468
mine (for) 8.612 8.792 8.698 8.710
mine (while) 12.267 13.263 12.591 12.706

mawk

algorithm min max median mean
trinary 12.406 13.299 12.827 12.806
modify 12.975 13.744 13.085 13.294
mine (for) 11.641 12.397 12.233 12.151
mine (while) 8.400 9.083 8.693 8.693

busybox awk is much less performant. Using 100,000 iterations:

algorithm min max median mean
trinary 11.004 12.284 11.784 11.605
modify 11.708 12.628 11.879 12.071
mine (for) 10.695 11.642 10.776 11.021
mine (while) 9.244 10.167 9.409 9.549

original-awk is faster than busybox in this situation. Using 250,000 iterations:

algorithm min max median mean
trinary 12.750 13.525 12.840 13.014
modify 13.105 13.778 13.942 13.572
mine (for) 11.831 12.543 12.710 12.281
mine (while) 10.730 10.854 11.466 11.032

I guessed my second method might be even faster due to not needing to dereference max in the loop but, intriguingly, although it runs noticably faster than the other algorithms in most implementations, it runs poorly in gawk, where my first method is the clear winner.

Are there even better methods?

Is there a best idiom for awk join?


My benchmark code was just:

for v in gawk mawk "busybox awk" original-awk; do
    for m in 1 2 3 4; do
        echo ::: $v $m :::
        for n in 1 2 3 4 5 6 7 8 9; do
            time </dev/null $v -f aw$m-${v:0:1}
        done
        echo
    done
done

aw3-b: (others are similar)

BEGIN {
    a[ 1] = "a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]"
    a[ 2] = "a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]"
    # ... more lines ...
    a[63] = "a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]"
    a[64] = "a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]"
    max=64
    sep=FS

    for(n=1; n<=100000; n++) {
        j=a[1]
        for (i=1;i<max;)
            j = j sep a[++i]
    }
}
like image 801
jhnc Avatar asked Oct 19 '25 10:10

jhnc


2 Answers

j = a[1]
for ( i=2; i<=max; i++ ) 
    j = j sep a[i]

is the clearest and most common approach so IMHO it's what should be used even if a tiny performance improvement could be squeezed out of a slightly less clear script.

The time when you typically see a ternary being used is when printing the contents of the array as:

for ( i=1; i<=max; i++ ) 
    printf "%s%s", a[i], (i<max ? OFS : ORS)

instead of the slightly more efficient but lengthier and requiring the printf formatting string and associated data values to be duplicated before the loop for the first index and inside the loop for all of the other indices:

printf "%s", a[1]
for ( i=2; i<=max; i++ ) 
    printf "%s%s", OFS, a[i]
print ""

Just yesterday I was surprised by a huge performance improvement using a recursive descent function instead of a loop, see https://stackoverflow.com/a/68389244/1745001, so I thought I'd give that a try but it failed miserably:

$ head -50 tst*.awk
==> tstLoop.awk <==
BEGIN {
    max = 64
    for (i=1; i<=max; i++) {
        a[i] = sprintf("%50s",i)
    }
    for (n=1; n<=100000; n++) {
        j = a[1]
        for (i=2; i<=max; i++) {
            j = j OFS a[i]
        }
    }
}

==> tstRec.awk <==
function join(i){
    if( i == 1 ) return a[i]
    return join(i-1) OFS a[i]
}

BEGIN {
    max = 64
    for (i=1; i<=max; i++) {
        a[i] = sprintf("%50s",i)
    }
    for (n=1; n<=100000; n++) {
        j = join(max)
    }
}

$ time awk -f tstLoop.awk

real    0m2.596s
user    0m2.561s
sys     0m0.030s

$ time awk -f tstRec.awk

real    0m3.743s
user    0m3.733s
sys     0m0.000s
like image 192
Ed Morton Avatar answered Oct 22 '25 04:10

Ed Morton


I do not know if it is better in the sense that the overall performance is in any way better than the for loop, but it is more clean for regular record processing use cases where the array is implicitly given from the record numbers:

awk 'FNR==1 {printf $1} FNR>1 {printf " "$1} END {printf "\n"}'

I understand that this is not what the Author really asked for, but it will likely be what people finding this post are really looking for.

like image 31
user3504575 Avatar answered Oct 22 '25 03:10

user3504575