Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

It is correct to switch the default perl's IO to utf-8 while using Plack and Middlewares?

Tags:

utf-8

perl

plack

Two starting points:

  • In his answer to Why does modern Perl avoid UTF-8 by default? tchrist pointed out 52 things needed to ensure correct Unicode handling in Perl. The answer shows the boilerplate code with some use statements. A similiar question about the use of Unicode is How to make "use My::defaults" with modern perl & utf8 defaults?
  • The PSGI spec is by design byte oriented. It is my responsibility to encode/decode everything, so for the Plack apps the correct way is to encode output and decode input, e.g.:

    use Encode;
    my $app = sub {
        my $output = encode_utf8( myapp() );
        return [ 200, [ 'Content-Type' =>'text/plain' ], [ $str ] ];
    };
    

Is it correct to use

use uni::perl; # or any similar

in the PSGI application and/or in my modules?

uni::perl changes Perl's default IO to UTF-8, thus:

use open qw(:std :utf8);
binmode(STDIN,   ":utf8");
binmode(STDOUT,  ":utf8");
binmode(STDERR,  ":utf8");

Will doing so break something in Plack or its middlewares? Or is the only correct way to write apps for Plack explicitely encoding/decoding at open, so without the open pragma?

like image 672
cajwine Avatar asked Jun 13 '12 09:06

cajwine


People also ask

What is UTF-8 encoding used for?

UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters.

What is UTF-8 character encoding?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

What does the 8 stand for in UTF-8?

UTF-8, which stands for 8-bit Unicode Transformation Format, is an encoding method for Unicode characters. It uses a sequence of at least eight binary digits known as code units.

How many UTF-8 characters are there?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.


1 Answers

You really don't want to set STDIN/STDOUT to be UTF-8 mode by default on Plack, because you don't know for instance whether they will be binary data transports. E.g. if those filehandles are the FastCGI protocol connector they will be carrying encoded binary structures and not UTF-8 text. They therefore must not have an encoding layer defined, or those binary structures will be mangled or rejected as invalid.

like image 181
LeoNerd Avatar answered Sep 23 '22 14:09

LeoNerd