Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What string should be used to specify encoding in Perl POD, "utf8", "UTF-8" or "utf-8"?

It is possible to write Perl documentation in UTF-8. To do it you should write in your POD:

=encoding NNN

But what should you write instead NNN? Different sources gives different answers.

  • perlpod says that that should be =encoding utf8
  • this stackoverflow answer states that it should be =encoding UTF-8
  • and this answer tells me to write =encoding utf-8

What is the correct answer? What is the correct string to be written in POD?

like image 853
bessarabov Avatar asked Aug 07 '13 16:08

bessarabov


People also ask

How do I encode a string in Perl?

$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.

What is the difference between UTF-8 and UTF-8?

UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias. it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

What is a UTF-8 encoded string?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Is UTF-8 character set or encoding?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters.


2 Answers

=encoding UTF-8

According to IANA, charset names are case-insensitive, so utf-8 is the same.

utf8 is Perl's lax variant of UTF-8. However, for safety, you want to be strict to your POD processors.

like image 194
daxim Avatar answered Sep 21 '22 13:09

daxim


As daxim points out, I have been misled. =encoding=UTF-8 and =encoding=utf-8 apply the strict encoding, and =encoding=utf8 is the lenient encoding:

$ cat enc-test.pod
=encoding ENCNAME

=head1 TEST '\344\273\245\376\202\200\200\200\200\200'

=cut

(here \xxx means the literal byte with value xxx. \344\273\245 is a valid UTF-8 sequence, \376\202\200\200\200\200\200 is not)

=encoding=utf-8:

$ perl -pe 's/ENCNAME/utf-8/' enc-test.pod | pod2cpanhtml | grep /h1
>TEST &#39;&#20197;&#27492;&#65533;&#39;</a></h1>

=encoding=utf8:

$ perl -pe 's/ENCNAME/utf8/' enc-test.pod | pod2cpanhtml | grep /h1
Code point 0x80000000 is not Unicode, no properties match it; ...
Code point 0x80000000 is not Unicode, no properties match it; ...
Code point 0x80000000 is not Unicode, no properties match it; ...
>TEST &#39;&#20197;&#2147483648;&#39;</a></h1>

They are all equivalent. The argument to =encoding is expected to be a name recognized by the Encode::Supported module. When you drill down into that document, you see

  • the canonical encoding name is utf8
  • the name UTF-8 is an alias for utf8, and
  • names are case insensitive, so utf-8 is equivalent to UTF-8

What's the best practice? I'm not sure. I don't think you go wrong using the official IANA name (as per daxim's answer), but you can't go wrong following the official Perl documentation, either.

like image 25
mob Avatar answered Sep 17 '22 13:09

mob