most efficient way to join array with separators in awk

Question

In awk scripts, I see several methods used to concatenate an array of strings with a separator, mainly trinary being used in some way, or the separator is changed on the fly. Occasionally printf/sprintf is used but I won't consider that here as I'd be surprised if the function call is not expensive.

For example, given an awk array a of strings, with integer indices from 1 to max, and separator sep:

trinary:

j=""
for ( i=1; i<=max; i++ )
    j = ( j ? j sep a[i] : a[i] )

modify:

j=s=""
for ( i=1; i<=max; i++ ) {
    j = j s a[i]
    s = sep
}

I have just come up with this method which I don't recall seeing before (ed: from subsequent comments, this turns out to be similar to a method used by awklib):

mine (for):

j = a[1]
for ( i=1; i<max; ) 
    j = j sep a[++i]

Naively, I would assume this to be more efficient as:

there are no tests other than for the loop counter;
there are no modifications to the separator; and
the redundant final test is elided.

Another option I just thought of:

mine (while):

j=a[i=max]
while (i>1)
    j = a[--i] sep j

Running a simple benchmark of gawk and mawk on a mostly idle Ubuntu 20.04 laptop (9 trials each of 1,000,000 iterations of joining a 64-element array of 50-char strings) gave me:

gawk

algorithm	min	max	median	mean
trinary	13.863	15.214	14.148	14.330
modify	11.052	11.926	11.360	11.468
mine (for)	8.612	8.792	8.698	8.710
mine (while)	12.267	13.263	12.591	12.706

mawk

algorithm	min	max	median	mean
trinary	12.406	13.299	12.827	12.806
modify	12.975	13.744	13.085	13.294
mine (for)	11.641	12.397	12.233	12.151
mine (while)	8.400	9.083	8.693	8.693

busybox awk is much less performant. Using 100,000 iterations:

algorithm	min	max	median	mean
trinary	11.004	12.284	11.784	11.605
modify	11.708	12.628	11.879	12.071
mine (for)	10.695	11.642	10.776	11.021
mine (while)	9.244	10.167	9.409	9.549

original-awk is faster than busybox in this situation. Using 250,000 iterations:

algorithm	min	max	median	mean
trinary	12.750	13.525	12.840	13.014
modify	13.105	13.778	13.942	13.572
mine (for)	11.831	12.543	12.710	12.281
mine (while)	10.730	10.854	11.466	11.032

I guessed my second method might be even faster due to not needing to dereference max in the loop but, intriguingly, although it runs noticably faster than the other algorithms in most implementations, it runs poorly in gawk, where my first method is the clear winner.

Are there even better methods?

Is there a best idiom for awk join?

My benchmark code was just:

for v in gawk mawk "busybox awk" original-awk; do
    for m in 1 2 3 4; do
        echo ::: $v $m :::
        for n in 1 2 3 4 5 6 7 8 9; do
            time </dev/null $v -f aw$m-${v:0:1}
        done
        echo
    done
done

aw3-b: (others are similar)

BEGIN {
    a[ 1] = "a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]"
    a[ 2] = "a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]"
    # ... more lines ...
    a[63] = "a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]"
    a[64] = "a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]"
    max=64
    sep=FS

    for(n=1; n<=100000; n++) {
        j=a[1]
        for (i=1;i<max;)
            j = j sep a[++i]
    }
}

Ed Morton · Accepted Answer

j = a[1]
for ( i=2; i<=max; i++ ) 
    j = j sep a[i]

is the clearest and most common approach so IMHO it's what should be used even if a tiny performance improvement could be squeezed out of a slightly less clear script.

The time when you typically see a ternary being used is when printing the contents of the array as:

for ( i=1; i<=max; i++ ) 
    printf "%s%s", a[i], (i<max ? OFS : ORS)

instead of the slightly more efficient but lengthier and requiring the printf formatting string and associated data values to be duplicated before the loop for the first index and inside the loop for all of the other indices:

printf "%s", a[1]
for ( i=2; i<=max; i++ ) 
    printf "%s%s", OFS, a[i]
print ""

Just yesterday I was surprised by a huge performance improvement using a recursive descent function instead of a loop, see https://stackoverflow.com/a/68389244/1745001, so I thought I'd give that a try but it failed miserably:

$ head -50 tst*.awk
==> tstLoop.awk <==
BEGIN {
    max = 64
    for (i=1; i<=max; i++) {
        a[i] = sprintf("%50s",i)
    }
    for (n=1; n<=100000; n++) {
        j = a[1]
        for (i=2; i<=max; i++) {
            j = j OFS a[i]
        }
    }
}

==> tstRec.awk <==
function join(i){
    if( i == 1 ) return a[i]
    return join(i-1) OFS a[i]
}

BEGIN {
    max = 64
    for (i=1; i<=max; i++) {
        a[i] = sprintf("%50s",i)
    }
    for (n=1; n<=100000; n++) {
        j = join(max)
    }
}

$ time awk -f tstLoop.awk

real    0m2.596s
user    0m2.561s
sys     0m0.030s

$ time awk -f tstRec.awk

real    0m3.743s
user    0m3.733s
sys     0m0.000s

user3504575 · Answer

I do not know if it is better in the sense that the overall performance is in any way better than the for loop, but it is more clean for regular record processing use cases where the array is implicitly given from the record numbers:

awk 'FNR==1 {printf $1} FNR>1 {printf " "$1} END {printf "
"}'

I understand that this is not what the Author really asked for, but it will likely be what people finding this post are really looking for.

most efficient way to join array with separators in awk

Tags:

performance

awk

jhnc

2 Answers

Ed Morton

user3504575

Recent Activity

Donate For Us

most efficient way to join array with separators in awk

Tags:

performance

awk

jhnc

2 Answers

Ed Morton

user3504575

Related questions

Recent Activity

Donate For Us