I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i
(case insensitive) modifier and presence of certain character combinations such as "ss"
or "st"
that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i
matches for instance on both AUGUST
(6 characters) and august
(5 characters, the last one being U+FB06)).
However if we remove /i
(case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa
modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example
/aai
or/aia
. (The first occurrence of "a" restricts the\d
, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for/i
matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st
can be a ligature. The same happens to fi
and ff
:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi
where, indeed, the lengths of alternatives isn't the same.
st
could be represented in a 1-character stylistic ligature as st
or ſt
, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff
, fi
, fl
, ß
, and st
/ſt
ligatures.
(ſt
represents ſt
, using the obsolete long s character; it matches st
and it does not match ft
.)
Perl also supports the remaining stylistic ligatures, ffi
and ffl
for ffi
and ffl
, though this isn't noteworthy in this context since lookbehinds already have issues with ff
and fi
/fl
separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct
and ch
) or debatably stylistic (such as the Dutch ij
for ij
or the obsolete Spanish ꝇ
for ll
). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs
for does
), though there are other scenarios, such as including ß
thanks to its uppercase form being SS
.
Perl 5.16.3 (and similarly old versions) only stumble on ss
(for ß
) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds for /(?<!August)x/i
(only the first will properly avoid August
):
/(?<!Augus[t])(?<!Augu(?=st).)x/i
(absolutely comprehensive)/(?<!Augu(?aa:st))x/i
(just the st
in the lookbehind is "ASCII-safe" ²)/(?<!(?aa)August)x/i
(the whole the lookbehind is "ASCII-safe" ²)/(?<!August)x/iaa
(the whole regex is "ASCII-safe" ²)/(?<!Augus[t])x/i
(breaks ligature seeking ¹)/(?<!Augus.)x/i
(slightly different, matches more)/(?<!Augu(?-i:st))x/i
(case-sensitive st
in lookbehind, won't match AugusTx
)These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st
(including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre
man page:
/i
& ligaturesThere are a number of Unicode characters that match a sequence of multiple characters under
/i
. For example, "LATIN SMALL LIGATURE FI" should match the sequencefi
. Perl is not currently able to do this when the multiple characters are in the pattern and are split between groupings, or when one or more are quantified. Thus"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+] "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match! "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match! "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
/aa
(perl 5.14+)To forbid ASCII/non-ASCII matches (like
k
with\N{KELVIN SIGN}
), specify thea
twice, for example/aai
or/aia
. (The first occurrence ofa
restricts the\d
, etc., and the second occurrence adds the/i
restrictions.) But, note that code points outside the ASCII range will use Unicode rules for/i
matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.To summarize, this modifier provides protection for applications that don't wish to be exposed to all of Unicode. Specifying it twice gives added protection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With