A specially constructed string is printed differently when I use
print $b;
or
print for split //, $b;
A minimal example is:
#!perl
use warnings;
use strict;
use Encode;
my $b = decode 'utf8', "\x{C3}\x{A1}\x{E2}\x{80}\x{93}\x{C3}\x{A1}"; # 'á–á' in Unicode;
print $b, "\n";
print for split //, $b
The output on the console screen (I think I use cp860) is:
Wide character in print at xx.pl line 9.
├íÔÇô├í
Wide character in print at xx.pl line 10.
ßÔÇôß
or in hex:
C3 A1 E2 80 93 C3 A1
E1 E2 80 93 E1
(separated by 0D 0A
of course, i.e., \r\n
).
The question is WHY is the character rendered differently?
Surprisingly, the effect disappears without the em-dash. The effect is seen for longer strings, as the following example shows.
For the string 'Él es mi tío Toño –Antonio Pérez' (typed as Unicode in the program; note that the two lines are different!):
Wide character in print at xx.pl line 14.
├ël es mi t├¡o To├▒o ÔÇôAntonio P├®rez
Wide character in print at xx.pl line 15.
╔l es mi tÝo To±o ÔÇôAntonio PÚrez
However, for the string 'Él es mi tío Toño, Antonio Pérez':
╔l es mi tÝo To±o, Antonio PÚrez
╔l es mi tÝo To±o, Antonio PÚrez
nothing bad happens, and the two lines are rendered in the same way. The only difference is the presence of an en-dash –
, i.e., '\x{E2}\x{80}\x{93}'
!
Also, print join '', split //, $b;
gives the same result as print $b;
but different from print for split //, $b;
.
If I add binmode STDOUT, 'utf8';
, then both outputs are ÔÇô├í
= E2 80 93 C3 A1.
So my question is not exactly about how to avoid it, but about why this happens: why does the same string behave differently when split?
Apparently in both cases the utf8
flag is on. Here is a more detailed program that shows more information about both strings: $a
before decode
and $b
after decode
:
#!perl
use warnings;
use strict;
use 5.010;
use Encode;
my $a = "\x{C3}\x{A1}\x{E2}\x{80}\x{93}\x{C3}\x{A1}"; # 'á–á' in Unicode;
my $b = decode 'utf8', $a;
say '------- length and utf8 ---------';
say "Length (a)=", length $a, ", is_uft8(a)=", (Encode::is_utf8 ($a) // 'no'), ".";
say "Length (b)=", length $b, ", is_uft8(b)=", (Encode::is_utf8 ($b) // 'no'), ".";
say '------- as a variable---------';
say "a: $a";
say "b: $b", ' <== *** WHY?! ***';
say '------- split ---------';
print "a: "; print for split //, $a; say '';
print "b: "; print for split //, $b; say ' <== *** DIFFERENT! ***';
say '------- split with spaces ---------';
print "a: "; print "[$_] " for split //, $a; say '';
print "b: "; print "[$_] " for split //, $b; say '';
say '------- split with properties ---------';
print "a: "; print "[$_ is_utf=" . Encode::is_utf8 ($_) . " length=" . length ($_) . "] " for split //, $a; say '';
print "b: "; print "[$_ is_utf=" . Encode::is_utf8 ($_) . " length=" . length ($_) . "] " for split //, $b; say '';
say '------- ord() ---------';
print "a: "; print ord, " " for split //, $a; say '';
print "b: "; print ord, " " for split //, $b; say '';
and here is its output on the console:
------- length and utf8 ---------
Length (a)=7, is_uft8(a)=.
Length (b)=3, is_uft8(b)=1.
------- as a variable---------
a: ├íÔÇô├í
Wide character in say at x.pl line 16.
b: ├íÔÇô├í <== *** WHY?! ***
------- split ---------
a: ├íÔÇô├í
Wide character in print at x.pl line 19.
b: ßÔÇôß <== *** DIFFERENT! ***
------- split with spaces ---------
a: [├] [í] [Ô] [Ç] [ô] [├] [í]
Wide character in print at x.pl line 22.
b: [ß] [ÔÇô] [ß]
------- split with properties ---------
a: [├ is_utf= length=1] [í is_utf= length=1] [Ô is_utf= length=1] [Ç is_utf= length=1] [ô is_utf= length=1] [├ is_utf= length=1] [í is_utf= length=1]
Wide character in print at x.pl line 25.
b: [ß is_utf=1 length=1] [ÔÇô is_utf=1 length=1] [ß is_utf=1 length=1]
------- ord() ---------
a: 195 161 226 128 147 195 161
b: 225 8211 225
The difference is whether the string being printed contains any characters >255. print
only knows you did something wrong in that situation[1].
Given a handle with no :encoding
, print
expects a string of bytes (string of characters ≤255).
When it doesn't receive bytes (the string contains characters >255), it notifies you of the error ("wide character") and guesses that you meant to encode the string using UTF-8.
You can think of print
on a handle with no :encoding
as doing the following:
if ($s =~ /[^\x00-\xFF]/) {
warn("Wide character");
utf8::encode($s);
}
my $b = decode 'utf8', "\x{C3}\x{A1}\x{E2}\x{80}\x{93}\x{C3}\x{A1}";
is the same as
my $b = "\xE1\x{2013}\xE1";
As such, you are doing
print "\xE1\x{2013}\xE1";
print "\xE1";
print "\x{2013}";
print "\xE1";
print "\xE1\x{2013}\xE1"; # Wide char! C3 A1 E2 80 93 C3 A1
Perl notices you forgot to encode, warns you, and prints the string encoded using UTF-8.
print "\xE1"; # E1
Perl has no way of knowing you forgot to encode, so it prints what you asked it to print.
print "\x{2013}"; # Wide char! E2 80 93
Perl notices you forgot to encode, warns you, and prints the string encoded using UTF-8.
Footnotes
The choice of storage format (as returned by is_utf8
) should never have an effect. print
is correctly unaffected by it.
utf8::downgrade( my $d = chr(0xE1) ); print($d); # UTF8=0 prints E1
utf8::upgrade( my $u = chr(0xE1) ); print($u); # UTF8=1 prints E1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With