Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl ord and chr working with unicode

Tags:

unicode

perl

To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear

Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.

Indeed I can print a smiley using

perl -e 'print chr(0x263a)'

but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.

I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.

So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?


The first answer I've got explains quite everything about IO, but I still don't understand why

#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';

print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";

print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";

prints

ne1 - eq1
match1 - no_match2

It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).

like image 791
maaartinus Avatar asked Sep 05 '12 23:09

maaartinus


People also ask

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.

How do I get the ascii value of a character in Perl?

The ord() function is an inbuilt function in Perl that returns the ASCII value of the first character of a string. This function takes a character string as a parameter and returns the ASCII value of the first character of this string.

What is CHR in Perl?

The chr() function in Perl returns a string representing a character whose Unicode code point is an integer. Syntax: chr(Num) Parameters: Num : It is an unicode integer value whose corresponding character is returned . Returns: a string representing a character whose Unicode code point is an integer.

How do you convert Ord to ascii?

To convert to ASCII from textual characters, you should use the chr() function, which takes an ASCII value as its only parameter and returns the text equivalent if there is one. The ord() function does the opposite - it takes a string and returns the equivalent ASCII value. For example: <?


1 Answers

First,

perl -le'print chr(0x263A);'

is buggy. Perl even tells you as much:

Wide character in print at -e line 1.

That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:

perl -le'print chr(0x263A);'

perl -le'print chr(0x00C0);'

To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.

$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺

$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À

Now on to the "why".

File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:

$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004

This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.

By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.

$ perl -e'
   use open ":std", ":encoding(UTF-8)";
   print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011

Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.

A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.

Here are a few example of operators that do assign meaning to the strings they receive as operands:

  • m// expects a string of Unicode code points.
  • connect expects a sequence of bytes that represent a sockaddr_in structure.
  • print with a handle without :encoding expect a sequence of bytes.
  • print with a handle with :encoding expect a sequence of Unicode code points.
  • etc

So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?

chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.


Regarding the question you've added:

There are problems with the encoding pragma. I recommend against using it. Instead, use

use open ':std', ':encoding(UTF-8)';

That'll fix one of the problems. The other problem you are encountering is with

chr(0x00C0) =~ /\w/

It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:

use 5.014;    # use 5.012; *might* suffice.

A workaround that works as far back as 5.8:

my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/
like image 152
ikegami Avatar answered Oct 06 '22 01:10

ikegami