Why is umlaut not recognized in a UTF-8-encoded Perl script with "use utf8"?

Tags:

The following script is encoded in UTF-8:

use utf8;

$fuer = pack('H*', '66c3bc72');

$fuer =~ s/ü/!!!/;

print $fuer;

The ü in the s/// is stored in the script as c3 bc, as the following xxd hex dump shows.

0000000: 75 73 65 20 75 74 66 38 3b 0a 0a 24 66 75 65 72  use utf8;..$fuer
0000010: 20 3d 20 70 61 63 6b 28 27 48 2a 27 2c 20 27 36   = pack('H*', '6
0000020: 36 63 33 62 63 37 32 27 29 3b 0a 0a 24 66 75 65  6c3bc72');..$fue
0000030: 72 20 3d 7e 20 73 2f c3 bc 2f 21 21 21 2f 3b 0a  r =~ s/../!!!/;.
0000040: 0a 70 72 69 6e 74 20 24 66 75 65 72 3b 0a        .print $fuer;.

c3 bc is the UTF-8 representation for ü.

Since the script is encoded in UTF-8 and I am useing utf8, I expected the script to replace the für in the variable $fuer - yet it doesn't.

It does, however, if I remove the use utf8. This runs against what I thought use utf8 was for: to indicate that the script is encoded in UTF-8.

773

asked Feb 11 '17 11:02

René Nyffenegger

2 Answers

The problem is with character boundaries. You are comparing an encoded string of bytes with a decoded character string

$fuer = pack('H*', '66c3bc72') creates the four-byte string "\x66\xc3\xbc\x72", whereas a small u with diaeresis ü is "\xfc" so the two don't match

If you used decode_utf8 from the Encode module to further process your variable $fuer then it would decode the UTF-8 to form the three-character string "\x66\xfc\x72", and the substitute would then work

use utf8 applies the equivalent to decode_utf8 to the whole source file, so without it your ü appears encoded as "\xc3\xbc", which matches the packed variable

129

answered Oct 19 '22 08:10

Borodin

Let's move out the ü out of the s/// and into its own variable so we can inspect it.

use utf8;                             # Script is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';  # Terminal expects UTF-8.

use strict;
use warnings;

my $uuml = "ü";
printf("%d %vX %s", length($uuml), $uuml, $uuml);   # 1 FC ü

my $fuer = pack('H*', '66c3bc72');
printf("%d %vX %s", length($fuer), $fuer, $fuer);   # 4 66.C3.BC.72 fÃ¼r

$fuer =~ s/\Q$uuml/!!!/;
printf("%d %vX %s", length($fuer), $fuer, $fuer);   # 4 66.C3.BC.72 fÃ¼r

As this makes obvious, you are comparing the Unicode Code Point of ü (FC) against the UTF-8 encoding of ü (C3 BC).

So yes, use utf8; indicates that script is encoded using UTF-8 ...but it does it so that Perl can correctly decode the script.

Decode all inputs and encode all outputs! The solution is to replace

my $fuer = pack('H*', '66c3bc72');

with

use Encode qw( decode_utf8 );

my $fuer = decode_utf8(pack('H*', '66c3bc72'));

my $fuer = pack('H*', '66c3bc72');
utf8::decode($fuer);

answered Oct 19 '22 09:10

ikegami

Related questions
                            
                                Perl: Reading from a 'tail -f' pipe via STDIN
                            
                                How to make a method "final" in Perl?
                            
                                Building a professional application in Perl?
                            
                                Delete Lines : after pattern1 and between pattern2 and pattern3 using awk/sed/perl
                            
                                Template Toolkit and lazy Moose attributes - how to make them behave?
                            
                                Why does Perl's two arg open seem to strip newlines?
                            
                                Why is 'NaN' numeric according to the warnings pragma?
                            
                                consecutive operators and brackets
                            
                                Perl - DBI and .pgpass
                            
                                How do you search by dn in ldap
                            
                                Handling authentication with Apache reverse proxy for plack/PSGI app
                            
                                Perl's Modules lists
                            
                                How to calculate the difference between two timestamp strings in Perl
                            
                                How to make a perl one-liner "line-endings agnostic"
                            
                                perl split strange behavior
                            
                                How to use different separators (/ , |) in a regular expression
                            
                                Is there a way to force void context in Perl?
                            
                                Writing to read-only attributes inside a Perl Moose class
                            
                                Is it faster to use alternation than subsequent replacements in regular expressions
                            
                                How to escape all special characters in a string (along with single and double quotes)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is umlaut not recognized in a UTF-8-encoded Perl script with "use utf8"?

Tags:

character-encoding

utf-8

perl

René Nyffenegger

People also ask

2 Answers

Borodin

ikegami

Recent Activity

Donate For Us