Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checklist for going the Unicode way with Perl

I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.

Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.

This is what I have done so far (forgive me for only including "summary" code examples):

  1. Made sure files are always read and written in UTF-8:

    use open ':utf8';
    
  2. Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):

    s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg;    # Kept from existing code
    s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg;     # Added
    utf8::decode $_;
    
  3. Made sure text is printed as UTF-8:

    binmode STDOUT, ':utf8';
    
  4. Made sure browsers interpret my content as UTF-8:

    Content-Type: text/html; charset=UTF-8
    <meta http-equiv="content-type" content="text/html;charset=UTF-8">
    
  5. Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):

    accept-charset="UTF-8"
    
  6. Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:

    use utf8;
    

Does this looks reasonable or am I missing something?

EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.

like image 345
W3Coder Avatar asked Sep 17 '10 13:09

W3Coder


People also ask

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.

Is UTF 8 Unicode?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.

How do I encode in Perl?

$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.


2 Answers

  • The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).

  • For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).

  • Make decoding fail hard with an exception, compare:

    perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
    perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
    
  • You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:

    use Encode qw(decode);
    use URI::Escape::XS qw(decodeURIComponent);
    $_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
    
  • Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.

  • Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.

  • Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).

  • For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.

like image 121
daxim Avatar answered Sep 28 '22 18:09

daxim


You're always missing something. The problem is usually the unknown unknowns, though. :)

Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.

Some other things you'll need to do:

  • Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.

  • Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �

  • Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).

  • Ensure that anything looking at @ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).

If you find anything else, please let us know by answering your own question with whatever we left out. ;)

like image 25
brian d foy Avatar answered Sep 28 '22 18:09

brian d foy