Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I make Perl code DWIM with UTF8?

Tags:

utf-8

perl

The utf8 pragma and utf8 encodings on filehandles have me confused. For example, this apparently straightforward code...

use utf8;
print qq[fü];

To be clear, the hex dump on "fü" is 66 c3 bc which if I'm not mistaken is proper UTF8.

That prints 66 fc which is not UTF8 but Unicode or maybe Latin-1. Turn off use utf8 and I get 66 c3 bc. This is the opposite of what I'd expect.

Now let's add in filehandle pramgas.

use utf8;
binmode *STDOUT, ':encoding(utf8)';
print qq[fü];

Now I get 66 c3 bc. But remove use utf8 and I get 66 c3 83 c2 bc which doesn't make any sense to me.

What's the right thing to do to make my code DWIM with UTF8?

PS My locale is set to "en_US.UTF-8" and Perl 5.10.1.

like image 244
Schwern Avatar asked Feb 17 '10 11:02

Schwern


People also ask

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.

How do I encode in Perl?

$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.

What UTF-8 means?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.

What are UTF-8 strings?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


1 Answers

use utf8; states that your source code is encoded in UTF8. By adding

binmode *STDOUT, ':encoding(utf8)';
print qq[fü];

you are asking that the script's output be encoded in UTF8 as well.

If you had written

print "f\x{00FC}\n";

you would not have needed use utf8;.

like image 167
Sinan Ünür Avatar answered Sep 24 '22 04:09

Sinan Ünür