The following script is encoded in UTF-8:
use utf8;
$fuer = pack('H*', '66c3bc72');
$fuer =~ s/ü/!!!/;
print $fuer;
The ü
in the s///
is stored in the script as c3 bc
, as the following xxd
hex dump shows.
0000000: 75 73 65 20 75 74 66 38 3b 0a 0a 24 66 75 65 72 use utf8;..$fuer
0000010: 20 3d 20 70 61 63 6b 28 27 48 2a 27 2c 20 27 36 = pack('H*', '6
0000020: 36 63 33 62 63 37 32 27 29 3b 0a 0a 24 66 75 65 6c3bc72');..$fue
0000030: 72 20 3d 7e 20 73 2f c3 bc 2f 21 21 21 2f 3b 0a r =~ s/../!!!/;.
0000040: 0a 70 72 69 6e 74 20 24 66 75 65 72 3b 0a .print $fuer;.
c3 bc
is the UTF-8 representation for ü
.
Since the script is encoded in UTF-8 and I am use
ing utf8
, I expected the script to replace the für
in the variable $fuer
- yet it doesn't.
It does, however, if I remove the use utf8
. This runs against what I thought use utf8
was for: to indicate that the script is encoded in UTF-8.
0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.
There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.
UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.
The problem is with character boundaries. You are comparing an encoded string of bytes with a decoded character string
$fuer = pack('H*', '66c3bc72')
creates the four-byte string "\x66\xc3\xbc\x72"
, whereas a small u with diaeresis ü
is "\xfc"
so the two don't match
If you used decode_utf8
from the Encode
module to further process your variable $fuer
then it would decode the UTF-8 to form the three-character string "\x66\xfc\x72"
, and the substitute would then work
use utf8
applies the equivalent to decode_utf8
to the whole source file, so without it your ü
appears encoded as "\xc3\xbc"
, which matches the packed variable
Let's move out the ü
out of the s///
and into its own variable so we can inspect it.
use utf8; # Script is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8.
use strict;
use warnings;
my $uuml = "ü";
printf("%d %vX %s", length($uuml), $uuml, $uuml); # 1 FC ü
my $fuer = pack('H*', '66c3bc72');
printf("%d %vX %s", length($fuer), $fuer, $fuer); # 4 66.C3.BC.72 für
$fuer =~ s/\Q$uuml/!!!/;
printf("%d %vX %s", length($fuer), $fuer, $fuer); # 4 66.C3.BC.72 für
As this makes obvious, you are comparing the Unicode Code Point of ü
(FC
) against the UTF-8 encoding of ü
(C3 BC
).
So yes, use utf8;
indicates that script is encoded using UTF-8 ...but it does it so that Perl can correctly decode the script.
Decode all inputs and encode all outputs! The solution is to replace
my $fuer = pack('H*', '66c3bc72');
with
use Encode qw( decode_utf8 );
my $fuer = decode_utf8(pack('H*', '66c3bc72'));
or
my $fuer = pack('H*', '66c3bc72');
utf8::decode($fuer);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With