I wonder why most modern solutions built using Perl don't enable UTF-8 by default.
I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.
Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?
Commenting @tchrist got too long, so I'm adding it here.
It seems that I did not make myself clear. Let me try to add some things.
tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.
tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.
I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.
Shouldn't that be a goal in modern Perlย 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.
Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!
I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perlย 5.
Why did UTF-8 replace the ASCII character-encoding standard? UTF-8 can store a character in more than one byte. UTF-8 replaced the ASCII character-encoding standard because it can store a character in more than a single byte. This allowed us to represent a lot more character types, like emoji.
UTF-8 is currently the most popular encoding method on the internet because it can efficiently store text containing any character. UTF-16 is another encoding method, but is less efficient for storing text files (except for those written in certain non-English languages).
While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.
The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc). However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice.
Set your PERL_UNICODE
envariable to AS
. This makes all Perl scripts decode @ARGV
as UTFโ8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTFโ8. Both these are global effects, not lexical ones.
At the top of your source file (program, module, library, do
hickey), prominently assert that you are running perl version 5.12 or better via:
use v5.12; # minimal for unicode string feature use v5.14; # optimal for unicode string feature
Enable warnings, since the previous declaration only enables strictures and features, not warnings. I also suggest promoting Unicode warnings into exceptions, so use both these lines, not just one of them. Note however that under v5.14, the utf8
warning class comprises three other subwarnings which can all be separately enabled: nonchar
, surrogate
, and non_unicode
. These you may wish to exert greater control over.
use warnings; use warnings qw( FATAL utf8 );
Declare that this source unit is encoded as UTFโ8. Although once upon a time this pragma did other things, it now serves this one singular purpose alone and no other:
use utf8;
Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTFโ8 unless you tell it otherwise. That way you do not affect other moduleโs or other programโs code.
use open qw( :encoding(UTF-8) :std );
Enable named characters via \N{CHARNAME}
.
use charnames qw( :full :short );
If you have a DATA
handle, you must explicitly set its encoding. If you want this to be UTFโ8, then say:
binmode(DATA, ":encoding(UTF-8)");
There is of course no end of other matters with which you may eventually find yourself concerned, but these will suffice to approximate the state goal to โmake everything just work with UTFโ8โ, albeit for a somewhat weakened sense of those terms.
One other pragma, although it is not Unicode related, is:
use autodie;
It is strongly recommended.
๐ด ๐ช๐ซ๐ช ๐ ๐ฒ๐ ๐ฟ๐๐๐ ๐๐๐ ๐ฏ๐ ๐ท๐๐๐๐๐๐๐ ๐ ๐ช๐ซ๐ช ๐
My own boilerplate these days tends to look like this:
use 5.014; use utf8; use strict; use autodie; use warnings; use warnings qw< FATAL utf8 >; use open qw< :std :utf8 >; use charnames qw< :full >; use feature qw< unicode_strings >; use File::Basename qw< basename >; use Carp qw< carp croak confess cluck >; use Encode qw< encode decode >; use Unicode::Normalize qw< NFD NFC >; END { close STDOUT } if (grep /\P{ASCII}/ => @ARGV) { @ARGV = map { decode("UTF-8", $_) } @ARGV; } $0 = basename($0); # shorter messages $| = 1; binmode(DATA, ":utf8"); # give a full stack dump on any untrapped exceptions local $SIG{__DIE__} = sub { confess "Uncaught exception: @_" unless $^S; }; # now promote run-time warnings into stack-dumped # exceptions *unless* we're in an try block, in # which case just cluck the stack dump instead local $SIG{__WARN__} = sub { if ($^S) { cluck "Trapped warning: @_" } else { confess "Deadly warning: @_" } }; while (<>) { chomp; $_ = NFD($_); ... } continue { say NFC($_); } __END__
Saying that โPerl should [somehow!] enable Unicode by defaultโ doesnโt even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case. Unicode is much much more than just a larger character repertoire; itโs also how those characters all interact in many, many ways.
Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to โupgradeโ to your spiffy new Brave New World modernity.
It is way way way more complicated than people pretend. Iโve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I donโt think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, youโll break either your own code or somebody elseโs. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.
๐ช goes out of its way to make Unicode easy, far more than anything else Iโve ever used. If you think this is bad, try something else for a while. Then come back to ๐ช: either you will have returned to a better world, or else you will bring knowledge of the same with you so that we can make use of your new knowledge to make ๐ช better at these things.
At a minimum, here are some things that would appear to be required for ๐ช to โenable Unicode by defaultโ, as you put it:
All ๐ช source code should be in UTF-8 by default. You can get that with use utf8
or export PERL5OPTS=-Mutf8
.
The ๐ช DATA
handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)")
.
Program arguments to ๐ช scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A
, or perl -CA
, or export PERL5OPTS=-CA
.
The standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S
for all of them, or I
, O
, and/or E
for just some of them. This is like perl -CS
.
Any other handles opened by ๐ช should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D
or with i
and o
for particular ones of these; export PERL5OPTS=-CD
would work. That makes -CSAD
for all of them.
Cover both bases plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std
. See uniquote.
You donโt want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8
. And make sure your input streams are always binmode
d to :encoding(UTF-8)
, not just to :utf8
.
Code points between 128โ255 should be understood by ๐ช to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings"
or export PERL5OPTS=-Mfeature=unicode_strings
. That will make uc("\xDF") eq "SS"
and "\xE9" =~ /\w/
. A simple export PERL5OPTS=-Mv5.12
or better will also get that.
Named Unicode characters are not by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek
or some such. See uninames and tcgrep.
You almost always need access to the functions from the standard Unicode::Normalize
module various types of decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD
, and then always run incoming stuff through NFD and outbound stuff from NFC. Thereโs no I/O layer for these yet that Iโm aware of, but see nfc, nfd, nfkd, and nfkc.
String comparisons in ๐ช using eq
, ne
, lc
, cmp
, sort
, &c&cc are always wrong. So instead of @a = sort @b
, you need @a = Unicode::Collate->new->sort(@b)
. Might as well add that to your export PERL5OPTS=-MUnicode::Collate
. You can cache the key for binary comparisons.
๐ช built-ins like printf
and write
do the wrong thing with Unicode data. You need to use the Unicode::GCString
module for the former, and both that and also the Unicode::LineBreak
module as well for the latter. See uwc and unifmt.
If you want them to count as integers, then you are going to have to run your \d+
captures through the Unicode::UCD::num
function because ๐ชโs built-in atoi(3) isnโt currently clever enough.
You are going to have filesystem issues on ๐ฝ filesystems. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.
All your ๐ช code involving a-z
or A-Z
and such MUST BE CHANGED, including m//
, s///
, and tr///
. Itโs should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.
Code that uses \p{Lu}
is almost as wrong as code that uses [A-Za-z]
. You need to use \p{Upper}
instead, and know the reason why. Yes, \p{Lowercase}
and \p{Lower}
are different from \p{Ll}
and \p{Lowercase_Letter}
.
Code that uses [a-zA-Z]
is even worse. And it canโt use \pL
or \p{Letter}
; it needs to use \p{Alphabetic}
. Not all alphabetics are letters, you know!
If you are looking for ๐ช variables with /[\$\@\%]\w+/
, then you have a problem. You need to look for /[\$\@\%]\p{IDS}\p{IDC}*/
, and even that isnโt thinking about the punctuation variables or package variables.
If you are checking for whitespace, then you should choose between \h
and \v
, depending. And you should never use \s
, since it DOES NOT MEAN [\h\v]
, contrary to popular belief.
If you are using \n
for a line boundary, or even \r\n
, then you are doing it wrong. You have to use \R
, which is not the same!
If you donโt know when and whether to call Unicode::Stringprep, then you had better learn.
Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such. The easiest way to do that is with the standard Unicode::Collate module. Unicode::Collate->new(level => 1)->cmp($a, $b)
. There are also eq
methods and such, and you should probably learn about the match
and substr
methods, too. These are have distinct advantages over the ๐ช built-ins.
Sometimes thatโs still not enough, and you need the Unicode::Collate::Locale module instead, as in Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b)
instead. Consider that Unicode::Collate::->new(level => 1)->eq("d", "รฐ")
is true, but Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " รฐ")
is false. Similarly, "ae" and "รฆ" are eq
if you donโt use locales, or if you use the English one, but they are different in the Icelandic locale. Now what? Itโs tough, I tell you. You can play with ucsort to test some of these things out.
Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string โniรฑoโ. Its NFD form โ which you had darned well better have remembered to put it in โ becomes โnin\x{303}oโ. Now what are you going to do? Even pretending that a vowel is [aeiou]
(which is wrong, by the way), you wonโt be able to do something like (?=[aeiou])\X)
either, because even in NFD a code point like โรธโ does not decompose! However, it will test equal to an โoโ using the UCA comparison I just showed you. You canโt rely on NFD, you have to rely on UCA.
And thatโs not all. There are a million broken assumptions that people make about Unicode. Until they understand these things, their ๐ช code will be broken.
Code that assumes it can open a text file without specifying the encoding is broken.
Code that assumes the default encoding is some sort of native platform encoding is broken.
Code that assumes that web pages in Japanese or Chinese take up less space in UTFโ16 than in UTFโ8 is wrong.
Code that assumes Perl uses UTFโ8 internally is wrong.
Code that assumes that encoding errors will always raise an exception is wrong.
Code that assumes Perl code points are limited to 0x10_FFFF is wrong.
Code that assumes you can set $/
to something that will work with any valid line separator is wrong.
Code that assumes roundtrip equality on casefolding, like lc(uc($s)) eq $s
or uc(lc($s)) eq $s
, is completely broken and wrong. Consider that the uc("ฯ")
and uc("ฯ")
are both "ฮฃ"
, but lc("ฮฃ")
cannot possibly return both of those.
Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ยช"
is a lowercase letter with no uppercase; whereas both "แต"
and "แดฌ"
are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not \p{Lowercase_Letter}
, despite being both \p{Letter}
and \p{Lowercase}
.
Code that assumes changing the case doesnโt change the length of the string is broken.
Code that assumes there are only two cases is broken. Thereโs also titlecase.
Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a \p{Mark}
turning into a \p{Letter}
. It can also make it switch from one script to another.
Code that assumes that case is never locale-dependent is broken.
Code that assumes Unicode gives a fig about POSIX locales is broken.
Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.
Code that assumes that diacritics \p{Diacritic}
and marks \p{Mark}
are the same thing is broken.
Code that assumes \p{GC=Dash_Punctuation}
covers as much as \p{Dash}
is broken.
Code that assumes dash, hyphens, and minuses are the same thing as each other, or that there is only one of each, is broken and wrong.
Code that assumes every code point takes up no more than one print column is broken.
Code that assumes that all \p{Mark}
characters take up zero print columns is broken.
Code that assumes that characters which look alike are alike is broken.
Code that assumes that characters which do not look alike are not alike is broken.
Code that assumes there is a limit to the number of code points in a row that just one \X
can match is wrong.
Code that assumes \X
can never start with a \p{Mark}
character is wrong.
Code that assumes that \X
can never hold two non-\p{Mark}
characters is wrong.
Code that assumes that it cannot use "\x{FFFF}"
is wrong.
Code that assumes a non-BMP code point that requires two UTF-16 (surrogate) code units will encode to two separate UTF-8 characters, one per code unit, is wrong. It doesnโt: it encodes to single code point.
Code that transcodes from UTFโ16 or UTFโ32 with leading BOMs into UTFโ8 is broken if it puts a BOM at the start of the resulting UTF-8. This is so stupid the engineer should have their eyelids removed.
Code that assumes the CESU-8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as "\xC0\x80"
is UTF-8 is broken and wrong. These guys also deserve the eyelid treatment.
Code that assumes characters like >
always points to the right and <
always points to the left are wrong โ because they in fact do not.
Code that assumes if you first output character X
and then character Y
, that those will show up as XY
is wrong. Sometimes they donโt.
Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong. Off with their heads! If that seems too extreme, we can compromise: henceforth they may type only with their big toe from one foot. (The rest will be duct taped.)
Code that assumes that all \p{Math}
code points are visible characters is wrong.
Code that assumes \w
contains only letters, digits, and underscores is wrong.
Code that assumes that ^
and ~
are punctuation marks is wrong.
Code that assumes that รผ
has an umlaut is wrong.
Code that believes things like โจ
contain any letters in them is wrong.
Code that believes \p{InLatin}
is the same as \p{Latin}
is heinously broken.
Code that believe that \p{InLatin}
is almost ever useful is almost certainly wrong.
Code that believes that given $FIRST_LETTER
as the first letter in some alphabet and $LAST_LETTER
as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}]
has any meaning whatsoever is almost always complete broken and wrong and meaningless.
Code that believes someoneโs name can only contain certain characters is stupid, offensive, and wrong.
Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again. Period. Iโm not even positive they should even be allowed to see again, since it obviously hasnโt done them much good so far.
Code that believes thereโs some way to pretend textfile encodings donโt exist is broken and dangerous. Might as well poke the other eye out, too.
Code that converts unknown characters to ?
is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.
Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal mรฉlange of hubris and naรฏvetรฉ that only a lightning bolt from Zeus will fix.
Code that believes you can use ๐ช printf
widths to pad and justify Unicode data is broken and wrong.
Code that believes once you successfully create a file by a given name, that when you run ls
or readdir
on its enclosing directory, youโll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.
Code that treats code points from one plane one whit differently than those from any other plane is ipso facto broken and wrong. Go back to school.
Code that believes that stuff like /s/i
can only match "S"
or "s"
is broken and wrong. Youโd be surprised.
Code that uses \PM\pM*
to find grapheme clusters instead of using \X
is broken and wrong.
People who want to go back to the ASCII world should be whole-heartedly encouraged to do so, and in honor of their glorious upgrade they should be provided gratis with a pre-electric manual typewriter for all their data-entry needs. Messages sent to them should be sent via an แดสสแดแดแดs telegraph at 40 characters per line and hand-delivered by a courier. STOP.
I donโt know how much more โdefault Unicode in ๐ชโ you can get than what Iโve written. Well, yes I do: you should be using Unicode::Collate
and Unicode::LineBreak
, too. And probably more.
As you see, there are far too many Unicode things that you really do have to worry about for there to ever exist any such thing as โdefault to Unicodeโ.
What youโre going to discover, just as we did back in ๐ช 5.8, that it is simply impossible to impose all these things on code that hasnโt been designed right from the beginning to account for them. Your well-meaning selfishness just broke the entire world.
And even once you do, there are still critical issues that require a great deal of thought to get right. There is no switch you can flip. Nothing but brain, and I mean real brain, will suffice here. Thereโs a heck of a lot of stuff you have to learn. Modulo the retreat to the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21หขแต century, and you cannot wish Unicode away by willful ignorance.
You have to learn it. Period. It will never be so easy that โeverything just works,โ because that will guarantee that a lot of things donโt work โ which invalidates the assumption that there can ever be a way to โmake it all work.โ
You may be able to get a few reasonable defaults for a very few and very limited operations, but not without thinking about things a whole lot more than I think you have.
As just one example, canonical ordering is going to cause some real headaches. ๐ญ"\x{F5}"
โรตโ, "o\x{303}"
โรตโ, "o\x{303}\x{304}"
โศญโ, and "o\x{304}\x{303}"
โลฬโ should all match โรตโ, but how in the world are you going to do that? This is harder than it looks, but itโs something you need to account for. ๐ฃ
If thereโs one thing I know about Perl, it is what its Unicode bits do and do not do, and this thing I promise you: โ ฬฒแดฬฒสฬฒแดฬฒสฬฒแดฬฒ ฬฒษชฬฒsฬฒ ฬฒษดฬฒแดฬฒ ฬฒUฬฒษดฬฒษชฬฒแดฬฒแดฬฒแด ฬฒแดฬฒ ฬฒแดฬฒแดฬฒษขฬฒษชฬฒแดฬฒ ฬฒสฬฒแดฬฒสฬฒสฬฒแดฬฒแดฬฒ ฬฒ โ ๐
You cannot just change some defaults and get smooth sailing. Itโs true that I run ๐ช with PERL_UNICODE
set to "SA"
, but thatโs all, and even that is mostly for command-line stuff. For real work, I go through all the many steps outlined above, and I do it very, ** very** carefully.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With