It is possible to write Perl documentation in UTF-8. To do it you should write in your POD:
=encoding NNN
But what should you write instead NNN
? Different sources gives different answers.
=encoding utf8
=encoding UTF-8
=encoding utf-8
What is the correct answer? What is the correct string to be written in POD?
$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.
UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias. it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters.
=encoding UTF-8
According to IANA, charset names are case-insensitive, so utf-8
is the same.
utf8
is Perl's lax variant of UTF-8. However, for safety, you want to be strict to your POD processors.
As daxim points out, I have been misled. =encoding=UTF-8
and =encoding=utf-8
apply the strict encoding, and =encoding=utf8
is the lenient encoding:
$ cat enc-test.pod
=encoding ENCNAME
=head1 TEST '\344\273\245\376\202\200\200\200\200\200'
=cut
(here \xxx
means the literal byte with value xxx
. \344\273\245
is a valid UTF-8 sequence, \376\202\200\200\200\200\200
is not)
=encoding=utf-8
:$ perl -pe 's/ENCNAME/utf-8/' enc-test.pod | pod2cpanhtml | grep /h1
>TEST '以此�'</a></h1>
=encoding=utf8
:$ perl -pe 's/ENCNAME/utf8/' enc-test.pod | pod2cpanhtml | grep /h1
Code point 0x80000000 is not Unicode, no properties match it; ...
Code point 0x80000000 is not Unicode, no properties match it; ...
Code point 0x80000000 is not Unicode, no properties match it; ...
>TEST '以�'</a></h1>
They are all equivalent. The argument to =encoding
is expected to be a name recognized by the Encode::Supported
module. When you drill down into that document, you see
utf8
UTF-8
is an alias for utf8
, andutf-8
is equivalent to UTF-8
What's the best practice? I'm not sure. I don't think you go wrong using the official IANA name (as per daxim's answer), but you can't go wrong following the official Perl documentation, either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With