Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make Perl 6 be round-trip safe for Unicode data?

Tags:

unicode

raku

A naïve Perl 6 program is not round-trip safe with respect to Unicode. It appears as if it internally uses Normalization Form Composition (NFC) for the Str type:

$ perl -CO -E 'say "e\x{301}"' | perl6 -ne '.say' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+00e9
U+000a

Poking through the docs I can't see anything about this behavior and I find it very shocking. I can't believe you have to drop back to the byte level to round-trip text:

$ perl -CO -E 'say "e\x{301}"' | perl6 -e 'while (my $byte = $*IN.read(1)) { $*OUT.write($byte) }' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+0065
U+0301
U+000a

Do all text files have to be in NFC to be safely round-tripped with Perl 6? What if the document is supposed to be in NFD? I must be missing something here. I cannot believe this is intentional behavior.

like image 975
Chas. Owens Avatar asked Sep 23 '16 14:09

Chas. Owens


People also ask

How do I store Unicode in a character?

To store Unicode in a char variable, simply create a char variable. char c; Now assign unicode.

Is UTF 8 Unicode?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.

What is Unicode in Perl?

Unicode Character Properties. (The only time that Perl considers a sequence of individual code points as a single logical character is in the \X construct, already mentioned above. Therefore "character" in this discussion means a single Unicode code point.)

What is Unicode character?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.


2 Answers

The answer seems to be to use the Uni type (the base class for NFD, NFC, etc), but it doesn't really do that now and there is no good way to get the file into a Uni string. So, until some unnamed point in the future, you cannot roundtrip a non-normalized file unless you treat it as bytes.

like image 65
Chas. Owens Avatar answered Oct 16 '22 18:10

Chas. Owens


Use UTF8-C8. From the documentation:

You can use UTF8-C8 with any file handle to read the exact bytes as they are on disk. They may look funny when printed out, if you print it out using a UTF8 handle. If you print it out to a handle where the output is UTF8-C8, then it will render as you would normally expect, and be a byte for byte exact copy.

like image 3
Christopher Bottoms Avatar answered Oct 16 '22 18:10

Christopher Bottoms