Why doesn't 'utf8-c8' encoding work when reading filehandles

Question

I wish to read byte sequences that will not decode as valid UTF-8, specifically byte sequences that correspond to high and low surrogates code points. The result should be a raku string.

I read that, in raku, the 'utf8-c8' encoding can be used for this purpose.

Consider code point U+D83F. It is a high surrogate (reserved for the high half of UTF-16 surrogate pairs).

U+D83F has a byte sequence of 0xED 0xA0 0xBF, if encoded as UTF-8.

Slurping a file? Works

If I slurp a file containing this byte sequence, using 'utf8-c8' as the encoding, I get the expected result:

echo -n $'\ud83f' >testfile # Create a test file containing the byte sequence

myprog1.raku:

#!/usr/local/bin/raku
$*OUT.encoding('utf8-c8');
print slurp('testfile', enc => 'utf8-c8');

$ ./myprog1.raku | od -An -tx1
 ed a0 bf

✔️ expected result

Slurping a filehandle? Doesn't work

But if I switch from slurping a file path to slurping a filehandle, it doesn't work, even though I set the filehandle's encoding to 'utf8-c8':

myprog2.raku

#!/usr/local/bin/raku
$*OUT.encoding('utf8-c8');
my $fh = open "testfile", :r, :enc('utf8-c8');
print slurp($fh, enc => 'utf8-c8');
#print $fh.slurp; # I tried this too: same error

$ ./myprog2.raku
Error encoding UTF-8 string: could not encode Unicode Surrogate codepoint 55359 (0xD83F)
  in block <unit> at ./myprog2.raku line 4

Environment

Edit 2022-10-30: I originally used my distro's package (Fedora Linux 36: Rakudo version 2020.07). I just downloaded the latest Rakudo binary release (2022.07-01). Result was the same.

$ /usr/local/bin/raku --version
Welcome to Rakudo™ v2022.07.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2022.07.

$ uname -a
Linux hx90 5.19.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Oct 16 22:50:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: Fedora
Description:    Fedora release 36 (Thirty Six)
Release:        36
Codename:       ThirtySix

2colours · Accepted Answer

I can reproduce the same behavior on Rakudo 2025.02 and Ubuntu 24.04.1 LTS.

Just to clarify, in the line

print slurp($fh, enc => 'utf8-c8');

enc doesn't really do anything because ultimately it will just turn into $fh.slurp(enc => 'utf8-c8') which doesn't take such named argument and will just silently drop it.

Other than that, I think it is a Rakudo bug, although I don't know enough NQP to fix it. This is the call for the path version and this is the call for the file handle version, if we skip the trivial redispatches that happen at API level. Long story short, it seems that the file handle version tries to be clever in one go while the path version first reads binary and then decodes that.

I haven't seen a bug report for this particular problem in the rakudo/rakudo repository - maybe you could open one.

Why doesn't 'utf8-c8' encoding work when reading filehandles

Tags:

raku

Slurping a file? Works

Slurping a filehandle? Doesn't work

Environment

Robin A. Meade

1 Answers

2colours

Recent Activity

Donate For Us

Why doesn't 'utf8-c8' encoding work when reading filehandles

Tags:

raku

Slurping a file? Works

Slurping a filehandle? Doesn't work

Environment

Robin A. Meade

1 Answers

2colours

Related questions

Recent Activity

Donate For Us