Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How could I catch an "Unicode non-character"-warning?

How could I catch the "Unicode non-character 0xffff is illegal for interchange"-warning?

#!/usr/bin/env perl
use warnings;
use 5.012;
use Try::Tiny;

use warnings FATAL => qw(all);

my $character;

try {
    $character = "\x{ffff}";
} catch {
    die "---------- caught error ----------\n";
};

say "something";

Output:

# Unicode non-character 0xffff is illegal for interchange at ./perl1.pl line 11.
like image 644
sid_com Avatar asked Feb 26 '11 15:02

sid_com


2 Answers

A   Perl 5.10.0 ⋯ 5.13.8   Bug

I’m going to assume that you don’t actually want to “catch” this warning, but rather to survive or ignore it. If you really want to catch it, well, there may be easier ways to do that.

But the first thing to know is that there is no such thing as an illegal code point, only code points not valid for interchange.

You just have to use a no warnings "utf8" for the scope of where you need to use the full Unicode range (or more). There is no need to use an eval for this. All it takes is a scoped warning suppression. Even that it is unnecessary on newer perls.

So instead of this:

$char = chr(0xFFFE);

write (on older perls):

$char = do { no warnings "utf8"; chr(0xFFFE) };

This is also the situation with pattern matches involving such a character:

 $did_match = do { no warnings "utf8" ; $char =~ $char);

will cause a warning or a fatal, depending on how old your perl, or nothing at all, depending on how new your perl is.

You can disable utf8-related warnings only on releases where it matters this way:

no if $^V < 5.13.9, qw<warnings utf8>;

‘Fixed in the Next Release’

The really interesting thing is that they (read: Perl5 Porters, and in particular, Karl Williamson) have fixed the bug that requires a no warnings "utf8" guard just to work with any code point at all. It is only the output where you may have to be careful. Watch:

% perl5.10.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode character 0xfffe is illegal at -e line 1.

% perl5.11.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.8 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.9 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

% perl5.13.10 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

The safest thing to do is put no warnings "utf8" in just the places you need it. But there is no need of an eval!

As of 5.13.10, and hence in 5.14, there are three subcategories of utf8 warnings: surrogate for UTF‑16, nonchar as described below, and non_unicode for supers, also defined below.

An All‐Perl Interchange is Safe

You probably don’t want to suppress the “illegal for interchange” warnings on output, though, because this is true. Well, unless you’re using Perl’s "utf8" encoding, which isn’t the same as its "UTF‑8" encoding, oddly enough. The "utf8" encoding is laxer than the formal standard, because it allows us to do more interesting things than we otherwise could.

However, if and only if you have a 100% pure-perl datapath, you can still use any code point you want, including non-unicode code points up to ᴍᴀxɪɴᴛ. That’s 0x7FFF_FFFF on 32‑bit machines, and something unspeakably huge on 64‑bit machines: 0xFFFF_FFFF_FFFF_FFFF! That’s not just a super; it’s a hypermega!

% perl -Mwarnings -CS -E 'my $a = chr(0xFFFF_FFFF); say $a ' | 
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Code point 0xFFFFFFFF is not Unicode, may not be portable at -e line 1.
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF); say $a' |
 perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF_FFFF_FFFF); say $a' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Hexadecimal number > 0xffffffff non-portable at -e line 1.
got ord 18446744073709551615

% perl -Mwarnings -CS -E 'no warnings qw[ utf8 portable ]; my $a = chr(0xFFFF_FFFF_FFFF_FFFF);  say $a ' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 18446744073709551615

Note that on a 32‑bit machine, that last one produces this:

Integer overflow in hexadecimal number at -e line 1.
got ord 4294967295

Varieties of Noncharacters Illegal for Interchange

There are several — quite a few, actually — different classes of code points that are not legal for interchange.

  • Any code point such that (ord(ᴄᴏᴅᴇᴘᴏɪɴᴛ) & 0xFFFE) == 0xFFFE is true. This covers the last two code points in all possible planes. As it spans 17 planes, Unicode defines therefore 34 such code points. Those are not characters, although they are Unicode code points. Let’s call these the Penults. They fall under the nonchar warning class on 5.13.10 or better.

  • The 32 code points starting at U+FDD0. These are guaranteed to be Noncharacters, although of course they are still Unicode code points. Like the previous penult set, these too fall under the nonchar warning class on 5.13.10 or better.

  • The 1024 high surrogates and the 1024 low surrogates, which were carved out as slop to make UTF‑16 possible for all those dumb systems that tried UCS‑2 instead of UTF‑8 or UTF‑32. This cripples the range of valid Unicode code points, restricting them to only the first 21 bits worth. SURROGATES ARE STILL CODE POINTS. They just are not valid for interchange, because they cannot always be correctly represented by brain-dead-clever UTF‑16. Under 5.13.10 or better, these are controlled by the surrogate warning subclass.

  • Beyond that, we’re now above the Unicode range. I’ll call these Supers. On a 32‑bit machine, you still have (10 or) 11 bits of them beyond the standard 21 bits that Unicode gives you. Perl can use these just fine. That gives 2**32 total code points you can use in your Perl program (well, or 2**31 at least, due to signed overflow). You get a million Unicode code points, but then you get a couple of billion Super code points beyond those that you can use in Perl. If you are running 5.13.10 or better, you can control access to these via the non_unicode warnings subclass.

  • Perl still follows the rules about Penults even up in the Super range. There are 480 such Superpenults on a 32‑bit machine, and rather more of them on a 64‑bit one.

  • If you really want to play it nonportably, then if you have native 64‑bit ints, you have another 32 or 33 bits above what the supers give you. You now have 18 quintillion 446 quadrillion 744 trillion 73 billion 709 million 551 thousand and 616 characters. You have a whole exabyte of distinct code points! That’s far beyond super that I’m going to call them Hypermegas. Ok, so these aren’t very portable, since they require a truly 64‑bit platform. They’re a bit foreign, so maybe we should write that Ὑπέρμεγας to scare people away. :) Note that the rules against penults still apply to hypermegas.


The Test Program

I wrote a little program that proves that these code points are cool.

testing Penults             passed all 34 codepoints
testing Super_penults       passed all 480 codepoints
testing Noncharacters       passed all 32 codepoints
testing Low_surrogates      passed all 1024 codepoints
testing High_surrogates     passed all 1024 codepoints
testing Supers              passed all 8 codepoints
testing Ὑπέρμεγας            passed all 10 codepoints

NOTE: That last line above shows a Yet Another Stupid Bug in SO’s infernal highlighting code. Notice the last WɪᴋɪWᴏʀᴅ up there, the \p{Greek} one, got left out of the colorization scheme? That means they are only looking for capitalized ASCII identifiers. Très passé! Why bother accepting ᴜɴɪᴄᴏᴅᴇ if you aren’t going to use things like \p{Uppercase} correctly? As you’ll see in my program where I have a @ὑπέρμεγας array, us ᴍᴏᴅᴇʀɴ ᴘʀᴏɢʀᴀᴍᴍɪɴɢ ʟᴀɴɢᴜᴀɢᴇs handle this perfectly fine. ☺

I obviously didn’t run all the supers or the hypers. And on 32‑bit machine, you’ll only get 4 of the tested hypers. I also didn’t test any of the hyperpenults.

Here’s the testing program, which runs cleanly on all version from 5.10 and up.

#!/usr/bin/env perl
#
# hypertest - show how to safely use code points not legal for interchange in Perl
# 
# Tom Christiansen
# [email protected]
# Sat Feb 26 16:38:44 MST 2011

use utf8;
use 5.10.0;
use strict;
use if $] > 5.010, "autodie";
use warnings FATAL => "all";

use Carp;

binmode(STDOUT, ":utf8");
END { close STDOUT }

$\ = "\n";

sub ghex(_);

my @penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x00 .. 0x10;

my @super_penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x10 .. 0xFF;

my @low_surrogates  = map { 0xDC00 + $_ } 0x000 .. 0x3FF;
my @high_surrogates = map { 0xD800 + $_ } 0x000 .. 0x3FF;

my @noncharacters = map { 0xFDD0 + $_ } 0x00 .. 0x1F;

my @supers = ( 
    0x0011_0000,  0x0100_0000,  0x1000_0000,  0x1F00_0000,  
    0x1FFF_FFFF,  0x3FFF_FFFF,  0x7FFF_FFFF,  0x7FFF_FFFF,  
);

# these should always work anywhere 
my @ὑπέρμεγας = ( 
    0x8000_0000,   0xF000_0000,   
    0x3FFF_FFFF,   0xFFFF_FFFF,  
);

####
# now we go fishing for 64-bit ὑπέρμεγας
####

eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => ( 
        0x01_0000_0000, 
        0x01_FFFF_FF00,
    );
};
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x0001_0000_0000_0000,
        0x001F_0000_0000_0000,
        0x7FFF_FFFF_FFFF_FFFF,
        0xFFFF_FFFF_FFFF_FFFF,
    );
};

# more than 64??
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x01_0001_0000_0000_0000,
        0x01_7FFF_FFFF_FFFF_FFFF,
        0x01_FFFF_FFFF_FFFF_FFFF,
    );
    1;
};


my @testpairs = (
    penults         => \@penults,
    super_penults   => \@super_penults,
    noncharacters   => \@noncharacters ,
    low_surrogates  => \@low_surrogates,
    high_surrogates => \@high_surrogates,
    supers          => \@supers,
    ὑπέρμεγας       => \@ὑπέρμεγας,   
);

while (my($name, $aref) = splice(@testpairs, 0, 2)) {
    printf "testing %-20s", ucfirst $name;

    my(@passed, @failed);

    for my $codepoint (@$aref) {

        use warnings FATAL => "all";

        my $char = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            chr(0xFFFF) && chr($codepoint);
        };

        my $regex_ok = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            $char =~ $char;
            1;
        };

        my $status = defined($char) && $regex_ok;

        push @{ $status ? \@passed : \@failed }, $codepoint;
    }

    my $total  = @$aref;
    my $passed = @passed;
    my $failed = @failed;

    given($total) {
        when ($passed)  { print "passed all $total codepoints" }
        when ($failed)  { print "failed all $total codepoints" }
        default         {
            print "of $total codepoints, failed $failed and passed $passed";
            my $flist = join(", ", map { ghex } @failed);
            my $plist = join(", ", map { ghex } @passed);
            print "\tpassed: $plist";
            print "\tfailed: $flist";
        }
    }

}

sub ghex(_) {
    my $num = shift();
    my $hex = sprintf("%X", $num);
    return $hex if length($hex) < 5;
    my $flip = reverse $hex;
    $flip =~ s<
        ( \p{ahex} \p{ahex} \p{ahex} \p{ahex} )
        (?= \p{ahex} )
        (?! \p{ahex}* \. )
    ><${1}_>gx;
    return "0x" . reverse($flip);
}
like image 177
tchrist Avatar answered Nov 04 '22 05:11

tchrist


It's a compile-time error, similar to forgetting to close a regex. If you delay the compilation of that piece to runtime, you can catch it:

#!/usr/bin/env perl
use warnings;

use warnings FATAL => qw(all);

my $character;

eval q{
    $character = "\x{ffff}";
};
if ($@) {
    die "---------- caught error ----------\n";
}

print "something\n";

Output:

---------- caught error  ----------

If you remove the q after eval, you'll get the same behavior as your script does now, since eval {...}; if($@) {...} is the same as try {...} catch {...};, but with the q it's an eval of a string, which is totally different.

UPDATE:
As Tom points out, you should probably just disable that warning with no warnings qw(utf8) in a narrow scope around the spot you're setting or getting those kinds of values. You may still want to catch utf8 warnings as errors on output (or anything else that sends the data outside your program):

#!/usr/bin/env perl
use warnings FATAL => qw(all);

my $character;

eval {
    no warnings qw(utf8);
    $character = "\x{ffff}";
};
if ($@) {
    die "---------- caught error  ----------\n";
}

print "something\n";
eval {
    print "something $character else\n";
};
if ($@) {
    die "---------- caught output error  ----------\n";
}

Output:

something
---------- caught output error  ----------
like image 33
freiheit Avatar answered Nov 04 '22 06:11

freiheit