I wish to read byte sequences that will not decode as valid UTF-8, specifically byte sequences that correspond to high and low surrogates code points. The result should be a raku string.
I read that, in raku, the 'utf8-c8' encoding can be used for this purpose.
Consider code point U+D83F. It is a high surrogate (reserved for the high half of UTF-16 surrogate pairs).
U+D83F has a byte sequence of 0xED 0xA0 0xBF, if encoded as UTF-8.
If I slurp a file containing this byte sequence, using 'utf8-c8' as the encoding, I get the expected result:
echo -n $'\ud83f' >testfile # Create a test file containing the byte sequence
myprog1.raku:
#!/usr/local/bin/raku
$*OUT.encoding('utf8-c8');
print slurp('testfile', enc => 'utf8-c8');
$ ./myprog1.raku | od -An -tx1
ed a0 bf
✔️ expected result
But if I switch from slurping a file path to slurping a filehandle, it doesn't work, even though I set the filehandle's encoding to 'utf8-c8':
myprog2.raku
#!/usr/local/bin/raku
$*OUT.encoding('utf8-c8');
my $fh = open "testfile", :r, :enc('utf8-c8');
print slurp($fh, enc => 'utf8-c8');
#print $fh.slurp; # I tried this too: same error
$ ./myprog2.raku
Error encoding UTF-8 string: could not encode Unicode Surrogate codepoint 55359 (0xD83F)
in block <unit> at ./myprog2.raku line 4
Edit 2022-10-30: I originally used my distro's package (Fedora Linux 36: Rakudo version 2020.07). I just downloaded the latest Rakudo binary release (2022.07-01). Result was the same.
$ /usr/local/bin/raku --version
Welcome to Rakudo™ v2022.07.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2022.07.
$ uname -a
Linux hx90 5.19.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Oct 16 22:50:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: Fedora
Description: Fedora release 36 (Thirty Six)
Release: 36
Codename: ThirtySix
I can reproduce the same behavior on Rakudo 2025.02 and Ubuntu 24.04.1 LTS.
Just to clarify, in the line
print slurp($fh, enc => 'utf8-c8');
enc doesn't really do anything because ultimately it will just turn into $fh.slurp(enc => 'utf8-c8') which doesn't take such named argument and will just silently drop it.
Other than that, I think it is a Rakudo bug, although I don't know enough NQP to fix it. This is the call for the path version and this is the call for the file handle version, if we skip the trivial redispatches that happen at API level. Long story short, it seems that the file handle version tries to be clever in one go while the path version first reads binary and then decodes that.
I haven't seen a bug report for this particular problem in the rakudo/rakudo repository - maybe you could open one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With