Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Blob.decode with replacement does not seem to work

Tags:

encoding

raku

This code:

my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say

Fails with:

Will not decode invalid ASCII (code point > 127 found)␤

And this one:

my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say

Simply does not seem to work, replacing € by ¬.

It's true that those methods are not tested, but is the syntax right?

like image 540
jjmerelo Avatar asked Mar 26 '19 08:03

jjmerelo


1 Answers

TL;DR:

  • Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.

  • If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1

  • Specifying the $replacement argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.

  • The "replacer" code path passes the $replacement and $strict arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.

  • On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2

Following the relevant code path

Your code calls this code in Buf.pm6:

multi method decode(Blob:D: $encoding,
                    Str    :$replacement!,
                    Bool:D :$strict = False) {
    nqp::p6box_s(
      nqp::decoderepconf(
        self,
        Rakudo::Internals.NORMALIZE_ENCODING($encoding),
        $replacement.defined ?? $replacement !! nqp::null_s(),
        $strict ?? 0 !! 1))
}

The nqp::decoderepconf function directly maps to a corresponding function in the backend.

On the MoarVM backend, it's MVM_string_decode_from_buf_config in ops.c.

This in turn calls MVM_string_decode_config in the same file.

From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:

Unlike MVM_string_decode, it will not pass through codepoints which have no official mapping.

For now windows-1252 and windows-1251 are the only ones this makes a difference on.

Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.

Also, to be clear, if one specifies the $replacement argument in P6 then the $strict argument is going to end up being ignored (and $strict = True assumed) if decoding any encoding other than the windows or shiftjis encodings.2

What happens with ascii and latin1 in particular

The current code for MVM_string_decode_config does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode and MVM_string_latin1_decode functions.

So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.

say "þor".ords; # (254 111 114)
say "3€".ords;  # (51 8364)

The first string (as a Buf) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.

The second replaces with ¬. This is because by default a Buf is an 8 bit array, so a value above 255 gets truncated to its low byte, which for is the same as ¬ (in both latin1 and Unicode).3

But it's no better if you use a Buf with a larger element size. The result is still a ¬, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.

Footnotes

1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.

2 It would be nice if there were multis that rejected use of the $replacement argument if the decoder for the specified encoding doesn't do anything with it.

3 See timotimo++'s comment below.

like image 129
raiph Avatar answered Nov 15 '22 11:11

raiph