Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Characterise Awk regex engine

Tags:

regex

awk

I'm trying to implement a robust ere_parenthesize function which requires to accurately parse the bracket expressions of a user-provided ERE.

The difficult part is that the support for character classes [: :], equivalence classes [= =] and collating symbols [. .] in bracket expressions differs between Awk implementations while being critical for determining the termination of a bracket expression.
A simple example would be that /[[:punct:]]/ is equivalent to /[:[punct]]/ when Awk doesn't support [: :].


I brainstormed a few runtime checks that are not enough to fully characterize a regex engine (given the constraint that they shall not trigger a fatal error). Still, I ran them with multiple Awks/OSs and made a few assumptions in light of the results:
note: I'm still looking for implementatons that would invalidate the assumtion #1

  1. An implementation that supports [= =] but doesn't support standard backslash-escape sequences within it always has the termination bug found here:

    match("]", /[[=x=]?]/) == 0 (support for equivalence classes)
    match("]", /[[=\t=]?]/) == 1 (no support for standard backslash-escape sequences within [= =])

    implies:

    match("]", /[[=x]?]/) == 1 (termination bug)

  2. An implementation that supports [= =] and standard backslash-escape sequences within it does not have termination bugs:

    match("]", /[[=x=]?]/) == 0 (support for equivalence classes)
    match("\t", /[[=\t=]?]/) == 1 (support for standard backslash-escape sequences within [= =])

    implies:

    match("]", /[[=\t]]/) (crash)

  3. An implementation that supports [: :] but doesn't support [= =] always has termination bugs:

    match("1", /[[:xdigit:]]/) == 1 (support for character classes)
    match("]", /[[=x=]?]/) == 1 (no support for equivalence classes)

    implies:

    match("]", /[[:xdigit]?]/) == 1 (termination bug)
    match("]", /[[:abc:]?]/) == 1 (termination bug)
    match("]", /[[::]?]/) == 1 (termination bug)
    match("]", /[[:]?]/) == 1 (termination bug)


My question is about confirming/invalidating the above assumptions; could you provide the results of running the following code with the Awks/OSs that you have at hand?

BEGIN {
    ere_brackets_have_character_classes    =  match("1", /[[:xdigit:]]/)
    ere_brackets_have_equivalence_classes  = !match("]", /[[=x=]?]/)
    ere_brackets_have_backslash_escape_bug =  match("]", /[[=\t=]?]/)

    print "ere_brackets_have_character_classes    :", ere_brackets_have_character_classes
    print "ere_brackets_have_equivalence_classes  :", ere_brackets_have_equivalence_classes
    print "ere_brackets_have_backslash_escape_bug :", ere_brackets_have_backslash_escape_bug

    if (ere_brackets_have_equivalence_classes) {
        if (ere_brackets_have_backslash_escape_bug) {
            print "Assumption #1: expected output: 1"
            r = "[[=x]?]"
            print match("]", r)
        } else {
            print "Assumption #2: expected output: crash"
            r = "[[=\\t]]"            
            match("]", r)
        }
    } else if (ere_brackets_have_character_classes) {
        print "Assumption #3: expected output: 1"
        split("[[:xdigit]?] [[:abc:]?] [[::]?] [[:]?]", a, " ")
        print match("]", a[1]) && \
              match("]", a[2]) && \
              match("]", a[3]) && \
              match("]", a[4])
    }
    else {
        print "no expected output: nothing"
    }
}

note: Some Awks compile the EREs before running the code when they are provided as string constants or within / /; as a workaround I stored them in variables.


ASIDE

match("1", /[[:xdigit:]]/) should be locale independent, am I right?

like image 423
Fravadona Avatar asked Mar 02 '26 00:03

Fravadona


1 Answers

FreeBSD 10.3-RELEASE-p7:

awk version 20121220 (FreeBSD)

ere_brackets_have_character_classes    : 1
ere_brackets_have_equivalence_classes  : 0
ere_brackets_have_backslash_escape_bug : 1
Assumption #3: expected output: 1
1

Ubuntu 22.04.4:

busybox → 1:1.30.1-7ubuntu3

ere_brackets_have_character_classes    : 1
ere_brackets_have_equivalence_classes  : 1
ere_brackets_have_backslash_escape_bug : 0
Assumption #2: expected output: crash
awk: bad regex '[[=\t]]': Unmatched [, [^, [:, [., or [=

original-awk (aka "nawk", "bwk awk", and "one true awk") → 2018-08-27-1

ere_brackets_have_character_classes    : 1
ere_brackets_have_equivalence_classes  : 0
ere_brackets_have_backslash_escape_bug : 1
Assumption #3: expected output: 1
1

mawk → 1.3.4.20200120-3

ere_brackets_have_character_classes    : 1
ere_brackets_have_equivalence_classes  : 0
ere_brackets_have_backslash_escape_bug : 1
Assumption #3: expected output: 1
mawk: run time error: regular expression compile failed (bad class -- [], [^] or [)
[[:xdigit]?]
    FILENAME="" FNR=0 NR=0

Debian 8.11:

mawk → 1.3.3-17

ere_brackets_have_character_classes    : 0
ere_brackets_have_equivalence_classes  : 0
ere_brackets_have_backslash_escape_bug : 1
no expected output: nothing
like image 52
jhnc Avatar answered Mar 03 '26 17:03

jhnc



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!