Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How the open :utf8 pragma works in perl - in relation to CPAN modules?

Tags:

perl

I'm already read thru the next:

  • Perl Unicode Cookbook
  • How differs the open pragma with different utf8?
  • How to use utf8 encode with open pragma
  • Does the autodie-pragma have influence on the encoding?
  • many others... (like: perlunicode perluniintro, perlunifaq, perlunitut)... ;(
  • and of course tchrist's geniue answer Why does modern Perl avoid UTF-8 by default?

but probably missed some BASIC points.

Using the

use open(:utf8);

Affects cpan modules too? E.g. when some CPAN module opens any file, it will be opened with :utf8? Is this statement TRUE? (or the open pragma is only lexically scoped?) AFAIK - it affects modules too, but in "inconsistent" way.. (probably it is a problem of the modules).

Have the open pragma effect to opendir? - what i already tried - no - i still need extra decode on all filenames coming from readdir (in addition to NFC). So, IO::Dir is something different - what open pragma doesn't covers?

Affect the open pragma sockets, pipes too? (e.g. anything what is a sort of IO::Handle ?)

All (or most) CPAN modules knows when doing i/o how they need to do it (utf8 or lattin1 or raw?) (probably not, because a simple autodie doesn't works with the open pragma... :()

In many places I can read a similar rule: Remember the canonical rule of Unicode: always encode/decode at the edges of your application. This is nice rule - but the application edge mean: my own source code. CPAN modules are (usually) behind the edge too - not only the "outer world", like system or network...

From my experiance, 3/4 of the content my short scripts (what heavily uses CPAN) contains: top declarations, and dozens of encode/decode/NFC for nearly everything...

E.g.: Even logging utilities, need explicit encoding:

use Log::Any qw($log);
use Log::Any::Adapter ('File', 'file.log');
$log->error( encode('utf-8', "tökös"));

Even, when want add tie to my code, need replace every $key $value with encoded versions.

Is this true, or i missed some really basic point in the all above doccu?

Some CPAN module handling utf8 (inside) like, JSON::XS, YAML::XS, File::Slurp.. (altough never succeeded get correct "things" from YAML::XS, pure YAML and JSON::XS works without any problems...

For some modules exists "hacks" - like DBIx::Class::ForceUTF8, Template::Stash::ForceUTF8, HTML::FillInForm::ForceUTF8 - and so, - what doesn't allow write correct application for "both" utf and non-utf world... ;(

Many CPAN modules doesn't calls internally the above 'hacked variants' - (e.g. HTML::FillInForm::ForceUTF8) but only the simple-one, so it is impossible to use them correctly with utf8... Others, silently fail.. ;(

Plack application doesn't handles utf8 logging messages without the annoying "Wide character...." ;( /modern perl :(/ and can continue ;(

From the above I "deducted" (probably wrongly) - than i MUST know and remember for every CPAN module how it is handling utf8 encoded strings and because nowhere is some "registry" - is is mostly trial/error based.

So the main question is:

While i remembering: Here is no magic bullet, but is here some good way how detect and know "utf8 ready CPAN modules" what doesn't need special encode/decode before using them?

If someone need to know, i'm using the next in my every script:

use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open(:utf8); #this sometimes is bad, so using only open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);

Hm.. just "discovered" the utf8:all perl module what replace the readdir with version doing decode.

like image 514
jm666 Avatar asked Jul 01 '13 15:07

jm666


1 Answers

Empahsis mine:

The open pragma serves as one of the interfaces to declare default "layers" (also known as "disciplines") for all I/O. Any two-argument open, readpipe (aka qx//) and similar operators found within the lexical scope of this pragma will use the declared defaults. Even three-argument opens may be affected by this pragma when they don't specify IO layers in MODE.

So no, it doesn't effect any code in which the pragma isn't present. A handle opened within the scope of such a pragma won't lose its layers if passed to code outside of the scope of the pragma, though.


Tests to see what a module expects:

Input

  • Test 1
    • Have the input source return a string containing a code point in 80..FF and no code point above FF.
  • Test 2
    • Have the input source return a string containing a code point above FF.

Output

  • Test 1
    • Output a string containing a code point in 80..FF and no code point above FF. Pass the string through utf8::downgrade($_); first.
  • Test 2
    • Same as Test 1, but pass the string through utf8::uprade($_); first.
  • Test 3
    • Output a string containing a code point above FF
like image 102
ikegami Avatar answered Nov 01 '22 03:11

ikegami