Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does PerlIO::encoding insert an additional utf8 layer?

Tags:

encoding

perl

The documentation for PerlIO says:

:encoding Use :encoding(ENCODING) either in open() or binmode() to install a layer that transparently does character set and encoding transformations, for example from Shift-JIS to Unicode. Note that under stdio an :encoding also enables :utf8 . See PerlIO::encoding for more information.

Here is a test script:

use feature qw(say);
use strict;
use warnings;

my $fn = 'test.txt';
for my $mode ('>', '>:encoding(utf8)' ) {
    open( my $fh, $mode, $fn);
    say  join ' ', (PerlIO::get_layers($fh));
    close $fh;
}

Output is:

unix perlio
unix perlio encoding(utf8) utf8

Why do I get the additional utf8 layer here?

like image 536
Håkon Hægland Avatar asked Jul 09 '15 19:07

Håkon Hægland


1 Answers

For reasons that require knowledge of Perl internals.


When you store the number 4 in a scalar, it could be stored as a signed integer, an unsigned integer or a floating point number. You don't know which is used, and you don't have any reason to care which one is used. Perl will automatically convert as needed.

It's the same situation for strings. There are two storage formats for them. Your name is the perfect example. "Håkon Hægland" can be stored as

48.E5.6B.6F.6E.20.48.E6.67.6C.61.6E.64

or as

48.C3.A5.6B.6F.6E.20.48.C3.A6.67.6C.61.6E.64

A flag called UTF8 indicates the choice of storage format. This is transparent to the user (or at least should be).

$ perl -Mutf8 -E'
    $_ = "Håkon Hægland";
    utf8::downgrade( $d = $_ );  # Converts to the first format mentioned above.
    utf8::upgrade(   $u = $_ );  # Converts to the second format mentioned above.
    say $d eq $u ? "eq" : "ne";
'
eq

While it's transparent to you, it's far from transparent to Perl itself. Whenever you manipulate a string, Perl has to check in which storage format it's stored. For example, if you concatenate two strings, Perl has to make sure they use the same storage format before performing the concatenation, converting one if necessary.

It's also not transparent to PerlIO. PerlIO, like the rest of Perl, has to deal with the bytes in the string buffer rather than what you see at the Perl level. Sometimes, those bytes are destined to be the string buffer of scalars with the UTF8 flag cleared, and sometimes, those bytes are destined to be the string buffer of scalars with the UTF8 flag set. PerlIO needs to track that. Rather than carrying a flag along from layer to layer, PerlIO adds a :utf8 layer when the scalars obtained by reading from the handle need to have the UTF8 flag set.


So, :encoding converts the bytes that form

Håkon Hægland

from the specified encoding to

48.C3.A5.6B.6F.6E.20.48.C3.A6.67.6C.61.6E.64

And :utf8 causes the scalar to have the UTF8 flag set, causing the resulting scalar to contain

U+0048.00E5.006B.006F.006E.0020.0048.00E6.0067.006C.0061.006E.0064
like image 108
ikegami Avatar answered Oct 15 '22 11:10

ikegami