Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl precompiled regex - utf8

When I do:

use strict; use warnings;
my $regex = qr/[[:upper:]]/;
my $line = MyModule::get_my_line_from_external_source(); #file, db, etc...
print "upper here\n" if( $line =~ $regex );

How perl will know when it must match only ascii uppercase and when utf8 uppercase? It is an precompiled regex - so somewhat perl must know, what is uppercase. Dependent on locale settings? If yes, how to match utf8 uppercase in "C" locale with precompiled regex?

updated based on tchrist's comments:

use strict; use warnings; use Encode;
my $regex = qr/[[:upper:]]/;

my $line = XXX::line();
print "$line: upper1 ", ($line =~ $regex) ? "YES" : "NO", "\n";

my $uline = Encode::decode_utf8($line);
print "$uline: upper2 ", ($uline =~ $regex) ? "YES" : "NO", "\n";

package XXX;
sub line { return "alpha-Ω"; } #returning octets - not utf8 chars

The output is:

alpha-Ω: upper1 NO
alpha-Ω: upper2 YES

What does it mean, that the precompiled regex is not 'hard-precompiled' but 'soft-precompiled' - so perl replace '[[:upper:]]' based on the utf8 flag of the matched $line.

like image 869
kobame Avatar asked May 20 '11 12:05

kobame


1 Answers

Before Perl 5.14, this was not very well defined.

With 5.14, the pattern known how it was compiled, and you have the /u, /l, /d, /a, or /aa pattern modifiers. You can also say

use re "/u";

or

use re "/msu";

to turn all those flags on in the lexical scope.

For example, under 5.14:

% perl -le 'print qr/foo/'
(?^:foo)
% perl -E 'say qr/foo/'
(?^u:foo)
% perl -E 'say qr/foo/l'
(?^l:foo)

I would stear clear of locales; just use all-Unicode.

BTW, I would make darned sure that that "external source" gave you back a string that was properly decoded; that is, has its UTF8 flag turned on. Character functions work poorly on encoded strings, because they really want decoded strings instead.

like image 64
tchrist Avatar answered Sep 20 '22 09:09

tchrist