I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode. Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions. This is what I have done so far (forgive me for only including "summary" code examples): <ol> <li> Made sure files are always read and written in UTF-8: <pre class="prettyprint"><code>use open ':utf8'; </code></pre> </li> <li> Made sure CGI input is received as UTF-8 (the site is not using CGI.pm): <pre class="prettyprint"><code>s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg; # Kept from existing code s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg; # Added utf8::decode $_; </code></pre> </li> <li> Made sure text is printed as UTF-8: <pre class="prettyprint"><code>binmode STDOUT, ':utf8'; </code></pre> </li> <li> Made sure browsers interpret my content as UTF-8: <pre class="prettyprint"><code>Content-Type: text/html; charset=UTF-8 <meta http-equiv="content-type" content="text/html;charset=UTF-8"> </code></pre> </li> <li> Made sure forms send UTF-8 (probably not necessary as long as page encoding is set): <pre class="prettyprint"><code>accept-charset="UTF-8" </code></pre> </li> <li> Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII: <pre class="prettyprint"><code>use utf8; </code></pre> </li> </ol> Does this looks reasonable or am I missing something? EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.

<ul> <li>The <code>:utf8</code> <code>PerlIO</code> layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the <code>PerlIO::encoding</code> layer, thus: <code>:encoding(UTF-8)</code>.</li> <li>For the same reason, always <code>Encode::decode('UTF-8', …)</code>, not <code>Encode::decode_utf8(…)</code>.</li> <li> Make decoding fail hard with an exception, compare: <pre class="prettyprint"><code>perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)' perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)' </code></pre> </li> <li> You are not taking care of surrogate pairs in the <code>%u</code> notation. This is the only major bug I can see in your list. <code>2.</code> is written correctly as: <pre class="prettyprint"><code>use Encode qw(decode); use URI::Escape::XS qw(decodeURIComponent); $_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK); </code></pre> </li> <li>Do not mess around with the functions from the <code>utf8</code> module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the <code>Encode</code> module.</li> <li>Add the <code>utf8</code> pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also <code>CodeLayout::RequireUseUTF8</code>.</li> <li>Employ <code>encoding::warnings</code> to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with <code>Unicode::Semantics</code>. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (<code>decode('UTF-8', …)</code>) or implicitly through a layer (<code>use open</code> pragma, <code>binmode</code>, 3 argument form of <code>open</code>).</li> <li>For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just <code>print</code>, use the tools <code>Devel::StringInfo</code> and <code>Devel::Peek</code> instead.</li> </ul>

You're always missing something. The problem is usually the unknown unknowns, though. :) Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing. Some other things you'll need to do: <ul> <li>Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.</li> <li>Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �</li> <li>Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).</li> <li>Ensure that anything looking at @ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).</li> </ul> If you find anything else, please let us know by answering your own question with whatever we left out. ;)

Checklist for going the Unicode way with Perl

Tags:

unicode

utf-8

perl

I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.

Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.

This is what I have done so far (forgive me for only including "summary" code examples):

Made sure files are always read and written in UTF-8:
```
use open ':utf8';
```

Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):

s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg;    # Kept from existing code
s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg;     # Added
utf8::decode $_;

Made sure text is printed as UTF-8:
```
binmode STDOUT, ':utf8';
```

Made sure browsers interpret my content as UTF-8:

Content-Type: text/html; charset=UTF-8
<meta http-equiv="content-type" content="text/html;charset=UTF-8">

Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):
```
accept-charset="UTF-8"
```
Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:
```
use utf8;
```

Does this looks reasonable or am I missing something?

EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.

345

asked Sep 17 '10 13:09

W3Coder

2 Answers

The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).
For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).

Make decoding fail hard with an exception, compare:

perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'

You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:
```
use Encode qw(decode);
use URI::Escape::XS qw(decodeURIComponent);
$_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
```
Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.
Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.
Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).
For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.

121

answered Sep 28 '22 18:09

daxim

You're always missing something. The problem is usually the unknown unknowns, though. :)

Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.

Some other things you'll need to do:

Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.
Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �
Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).
Ensure that anything looking at @ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).

If you find anything else, please let us know by answering your own question with whatever we left out. ;)

answered Sep 28 '22 18:09

brian d foy

Related questions
                            
                                Gdata package perl issue
                            
                                How to connect to MySQL using UTF8 within a perl script?
                            
                                What is Perl's "standard string comparison order"?
                            
                                How do I create then use long Windows paths from Perl?
                            
                                How can I handle Javascript in a Perl web crawler?
                            
                                What are the differences between perl and java regex capabilities?
                            
                                How can I use a variable as a module name in Perl?
                            
                                Is there an inf constant in Perl?
                            
                                I installed a module successfully with CPAN, but perl can't find it. Why?
                            
                                How can I allow undefined options when parsing args with Getopt
                            
                                Simple hash search by value
                            
                                writing Unicode-aware one-liners in Perl
                            
                                What is the difference between `read` and `sysread`?
                            
                                What's the easiest way to get an equivalent to GNU grep that supports negative lookbehinds?
                            
                                Is there anything like IPython / IRB for Perl?
                            
                                Which features of Perl make it a functional programming language?
                            
                                python regex where a set of options can occur at most once in a list, in any order
                            
                                simple inter-process communication
                            
                                How to convert an array reference to an array in Perl?
                            
                                Is checking Perl function arguments worth it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With