I'm trying to implement a robust ere_parenthesize function which requires to accurately parse the bracket expressions of a user-provided ERE.
The difficult part is that the support for character classes [: :], equivalence classes [= =] and collating symbols [. .] in bracket expressions differs between Awk implementations while being critical for determining the termination of a bracket expression.
A simple example would be that /[[:punct:]]/ is equivalent to /[:[punct]]/ when Awk doesn't support [: :].
I brainstormed a few runtime checks that are not enough to fully characterize a regex engine (given the constraint that they shall not trigger a fatal error). Still, I ran them with multiple Awks/OSs and made a few assumptions in light of the results:
note: I'm still looking for implementatons that would invalidate the assumtion #1
An implementation that supports [= =] but doesn't support standard backslash-escape sequences within it always has the termination bug found here:
match("]", /[[=x=]?]/) == 0 (support for equivalence classes)
match("]", /[[=\t=]?]/) == 1 (no support for standard backslash-escape sequences within [= =])
implies:
match("]", /[[=x]?]/) == 1 (termination bug)
An implementation that supports [= =] and standard backslash-escape sequences within it does not have termination bugs:
match("]", /[[=x=]?]/) == 0 (support for equivalence classes)
match("\t", /[[=\t=]?]/) == 1 (support for standard backslash-escape sequences within [= =])
implies:
match("]", /[[=\t]]/) (crash)
An implementation that supports [: :] but doesn't support [= =] always has termination bugs:
match("1", /[[:xdigit:]]/) == 1 (support for character classes)
match("]", /[[=x=]?]/) == 1 (no support for equivalence classes)
implies:
match("]", /[[:xdigit]?]/) == 1 (termination bug)
match("]", /[[:abc:]?]/) == 1 (termination bug)
match("]", /[[::]?]/) == 1 (termination bug)
match("]", /[[:]?]/) == 1 (termination bug)
My question is about confirming/invalidating the above assumptions; could you provide the results of running the following code with the Awks/OSs that you have at hand?
BEGIN {
ere_brackets_have_character_classes = match("1", /[[:xdigit:]]/)
ere_brackets_have_equivalence_classes = !match("]", /[[=x=]?]/)
ere_brackets_have_backslash_escape_bug = match("]", /[[=\t=]?]/)
print "ere_brackets_have_character_classes :", ere_brackets_have_character_classes
print "ere_brackets_have_equivalence_classes :", ere_brackets_have_equivalence_classes
print "ere_brackets_have_backslash_escape_bug :", ere_brackets_have_backslash_escape_bug
if (ere_brackets_have_equivalence_classes) {
if (ere_brackets_have_backslash_escape_bug) {
print "Assumption #1: expected output: 1"
r = "[[=x]?]"
print match("]", r)
} else {
print "Assumption #2: expected output: crash"
r = "[[=\\t]]"
match("]", r)
}
} else if (ere_brackets_have_character_classes) {
print "Assumption #3: expected output: 1"
split("[[:xdigit]?] [[:abc:]?] [[::]?] [[:]?]", a, " ")
print match("]", a[1]) && \
match("]", a[2]) && \
match("]", a[3]) && \
match("]", a[4])
}
else {
print "no expected output: nothing"
}
}
note: Some Awks compile the EREs before running the code when they are provided as string constants or within / /; as a workaround I stored them in variables.
match("1", /[[:xdigit:]]/) should be locale independent, am I right?
FreeBSD 10.3-RELEASE-p7:
awk version 20121220 (FreeBSD)
ere_brackets_have_character_classes : 1
ere_brackets_have_equivalence_classes : 0
ere_brackets_have_backslash_escape_bug : 1
Assumption #3: expected output: 1
1
Ubuntu 22.04.4:
busybox → 1:1.30.1-7ubuntu3
ere_brackets_have_character_classes : 1
ere_brackets_have_equivalence_classes : 1
ere_brackets_have_backslash_escape_bug : 0
Assumption #2: expected output: crash
awk: bad regex '[[=\t]]': Unmatched [, [^, [:, [., or [=
original-awk (aka "nawk", "bwk awk", and "one true awk") → 2018-08-27-1
ere_brackets_have_character_classes : 1
ere_brackets_have_equivalence_classes : 0
ere_brackets_have_backslash_escape_bug : 1
Assumption #3: expected output: 1
1
mawk → 1.3.4.20200120-3
ere_brackets_have_character_classes : 1
ere_brackets_have_equivalence_classes : 0
ere_brackets_have_backslash_escape_bug : 1
Assumption #3: expected output: 1
mawk: run time error: regular expression compile failed (bad class -- [], [^] or [)
[[:xdigit]?]
FILENAME="" FNR=0 NR=0
Debian 8.11:
mawk → 1.3.3-17
ere_brackets_have_character_classes : 0
ere_brackets_have_equivalence_classes : 0
ere_brackets_have_backslash_escape_bug : 1
no expected output: nothing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With