When I do:
use strict; use warnings;
my $regex = qr/[[:upper:]]/;
my $line = MyModule::get_my_line_from_external_source(); #file, db, etc...
print "upper here\n" if( $line =~ $regex );
How perl will know when it must match only ascii uppercase
and when utf8 uppercase
?
It is an precompiled regex - so somewhat perl must know, what is uppercase. Dependent on locale settings? If yes, how to match utf8 uppercase in "C" locale with precompiled regex?
updated based on tchrist's comments:
use strict; use warnings; use Encode;
my $regex = qr/[[:upper:]]/;
my $line = XXX::line();
print "$line: upper1 ", ($line =~ $regex) ? "YES" : "NO", "\n";
my $uline = Encode::decode_utf8($line);
print "$uline: upper2 ", ($uline =~ $regex) ? "YES" : "NO", "\n";
package XXX;
sub line { return "alpha-Ω"; } #returning octets - not utf8 chars
The output is:
alpha-Ω: upper1 NO
alpha-Ω: upper2 YES
What does it mean, that the precompiled regex is not 'hard-precompiled' but 'soft-precompiled' - so perl replace '[[:upper:]]' based on the utf8 flag of the matched $line.
Before Perl 5.14, this was not very well defined.
With 5.14, the pattern known how it was compiled, and you have the /u
, /l
, /d
, /a
, or /aa
pattern modifiers. You can also say
use re "/u";
or
use re "/msu";
to turn all those flags on in the lexical scope.
For example, under 5.14:
% perl -le 'print qr/foo/'
(?^:foo)
% perl -E 'say qr/foo/'
(?^u:foo)
% perl -E 'say qr/foo/l'
(?^l:foo)
I would stear clear of locales; just use all-Unicode.
BTW, I would make darned sure that that "external source" gave you back a string that was properly decoded; that is, has its UTF8 flag turned on. Character functions work poorly on encoded strings, because they really want decoded strings instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With