Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl's use encoding pragma breaking UTF strings

Tags:

perl

I have a problem with Perl and Encoding pragma.

(I use utf-8 everywhere, in input, output, the perl scripts themselves. I don't want to use other encoding, never ever.)

However. When I write

binmode(STDOUT, ':utf8');
use utf8;
$r = "\x{ed}";
print $r;

I see the string "í" (which is what I want - and what is U+00ED unicode char). But when I add the "use encoding" pragma like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
print $r;

all I see is a box character. Why?

Moreover, when I add Data::Dumper and let the Dumper print the new string like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
use Data::Dumper;
print Dumper($r);

I see that perl changed the string to "\x{fffd}". Why?

like image 398
Karel Bílek Avatar asked Mar 19 '11 15:03

Karel Bílek


2 Answers

use encoding 'utf8' is broken. Rather than interpreting \x{ed} as the code point U+00ED, it interprets it as the single byte 237 and then tries to interpret that as UTF-8. Which of course fails, so it winds up replacing it with the replacement character U+FFFD, literally "�".

Just stick with use utf8 to specify that your source is in UTF-8, and binmode or the open pragma to specify the encoding for your file handles.

like image 134
Anomie Avatar answered Oct 02 '22 19:10

Anomie


Your actual code needs neither use encoding nor use utf8 to run properly -- the only thing it depends on is the encoding layer on STDOUT.

binmode(STDOUT, ":utf8");
print "\xed";

is an equally valid complete program that does what you want.

use utf8 should be used only if you have UTF-8 in literal strings in your program -- e.g. if you had written

my $r = "í";

then use utf8 would cause that string to be interpreted as the single character U+00ED instead of the series of bytes C3 AD.

use encoding should never be used, especially by someone who likes Unicode. If you want the encoding of stdin/out to be changed you should use -C or PERLUNICODE or binmode them yourself, and if you want other handles to be automatically openhed with encoding layers you should useopen.

like image 30
hobbs Avatar answered Oct 02 '22 17:10

hobbs