Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is umlaut not recognized in a UTF-8-encoded Perl script with "use utf8"?

The following script is encoded in UTF-8:

use utf8;

$fuer = pack('H*', '66c3bc72');

$fuer =~ s/ü/!!!/;

print $fuer;

The ü in the s/// is stored in the script as c3 bc, as the following xxd hex dump shows.

0000000: 75 73 65 20 75 74 66 38 3b 0a 0a 24 66 75 65 72  use utf8;..$fuer
0000010: 20 3d 20 70 61 63 6b 28 27 48 2a 27 2c 20 27 36   = pack('H*', '6
0000020: 36 63 33 62 63 37 32 27 29 3b 0a 0a 24 66 75 65  6c3bc72');..$fue
0000030: 72 20 3d 7e 20 73 2f c3 bc 2f 21 21 21 2f 3b 0a  r =~ s/../!!!/;.
0000040: 0a 70 72 69 6e 74 20 24 66 75 65 72 3b 0a        .print $fuer;.

c3 bc is the UTF-8 representation for ü.

Since the script is encoded in UTF-8 and I am useing utf8, I expected the script to replace the für in the variable $fuer - yet it doesn't.

It does, however, if I remove the use utf8. This runs against what I thought use utf8 was for: to indicate that the script is encoded in UTF-8.

like image 773
René Nyffenegger Avatar asked Feb 11 '17 11:02

René Nyffenegger


People also ask

What characters are not allowed in UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

Is UTF-8 the same as UTF-8?

There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.

Can UTF-8 support all characters?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.


2 Answers

The problem is with character boundaries. You are comparing an encoded string of bytes with a decoded character string

$fuer = pack('H*', '66c3bc72') creates the four-byte string "\x66\xc3\xbc\x72", whereas a small u with diaeresis ü is "\xfc" so the two don't match

If you used decode_utf8 from the Encode module to further process your variable $fuer then it would decode the UTF-8 to form the three-character string "\x66\xfc\x72", and the substitute would then work

use utf8 applies the equivalent to decode_utf8 to the whole source file, so without it your ü appears encoded as "\xc3\xbc", which matches the packed variable

like image 129
Borodin Avatar answered Oct 19 '22 08:10

Borodin


Let's move out the ü out of the s/// and into its own variable so we can inspect it.

use utf8;                             # Script is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';  # Terminal expects UTF-8.

use strict;
use warnings;

my $uuml = "ü";
printf("%d %vX %s", length($uuml), $uuml, $uuml);   # 1 FC ü

my $fuer = pack('H*', '66c3bc72');
printf("%d %vX %s", length($fuer), $fuer, $fuer);   # 4 66.C3.BC.72 für

$fuer =~ s/\Q$uuml/!!!/;
printf("%d %vX %s", length($fuer), $fuer, $fuer);   # 4 66.C3.BC.72 für

As this makes obvious, you are comparing the Unicode Code Point of ü (FC) against the UTF-8 encoding of ü (C3 BC).

So yes, use utf8; indicates that script is encoded using UTF-8 ...but it does it so that Perl can correctly decode the script.

Decode all inputs and encode all outputs! The solution is to replace

my $fuer = pack('H*', '66c3bc72');

with

use Encode qw( decode_utf8 );

my $fuer = decode_utf8(pack('H*', '66c3bc72'));

or

my $fuer = pack('H*', '66c3bc72');
utf8::decode($fuer);
like image 31
ikegami Avatar answered Oct 19 '22 09:10

ikegami