A naïve Perl 6 program is not round-trip safe with respect to Unicode. It appears as if it internally uses Normalization Form Composition (NFC) for the Str type:
$ perl -CO -E 'say "e\x{301}"' | perl6 -ne '.say' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+00e9
U+000a
Poking through the docs I can't see anything about this behavior and I find it very shocking. I can't believe you have to drop back to the byte level to round-trip text:
$ perl -CO -E 'say "e\x{301}"' | perl6 -e 'while (my $byte = $*IN.read(1)) { $*OUT.write($byte) }' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+0065
U+0301
U+000a
Do all text files have to be in NFC to be safely round-tripped with Perl 6? What if the document is supposed to be in NFD? I must be missing something here. I cannot believe this is intentional behavior.
To store Unicode in a char variable, simply create a char variable. char c; Now assign unicode.
UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.
Unicode Character Properties. (The only time that Perl considers a sequence of individual code points as a single logical character is in the \X construct, already mentioned above. Therefore "character" in this discussion means a single Unicode code point.)
Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.
The answer seems to be to use the Uni type (the base class for NFD, NFC, etc), but it doesn't really do that now and there is no good way to get the file into a Uni string. So, until some unnamed point in the future, you cannot roundtrip a non-normalized file unless you treat it as bytes.
Use UTF8-C8
. From the documentation:
You can use UTF8-C8 with any file handle to read the exact bytes as they are on disk. They may look funny when printed out, if you print it out using a UTF8 handle. If you print it out to a handle where the output is UTF8-C8, then it will render as you would normally expect, and be a byte for byte exact copy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With