Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is JSON::XS Not Generating Valid UTF-8?

Tags:

json

utf-8

perl

I'm getting some corrupted JSON and I've reduced it down to this test case.

use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;

BEGIN {
    # damn it
    my $builder = Test::Builder->new;
    foreach (qw/output failure_output todo_output/) {
        binmode $builder->$_, ':encoding(UTF-8)';
    }
}

foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
    my $hashref = { value => $string };
    is_sane_utf8 $string, "String: $string";
    my $json = encode_json($hashref);
    is_sane_utf8 $json, "JSON: $json";
    say STDERR $json;
}
diag ord('»');

done_testing;

And this is the output:

utf8.t .. 
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver «French Bread»"}

#   Failed test 'JSON: {"value":"Deliver «French Bread»"}'
#   at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}    
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187

So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The utf8 pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).

Switching to JSON::PP produces the same output.

This is Perl 5.18.1 running on OS X Yosemite.

like image 508
Ovid Avatar asked Dec 06 '14 19:12

Ovid


2 Answers

is_sane_utf8 doesn't do what you think it does. You're suppose to pass strings you've decoded to it. I'm not sure what's the point of it, but it's not the right tool. If you want to check if a string is valid UTF-8, you could use

ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
   '$string is valid UTF-8');

To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged.

          +--------------------- Start of two byte sequence
          |    +---------------- Not zero (good)     
          |    |     +---------- Continuation byte indicator (good)
          |    |     |
          v    v     v
C2 AB = [110]00010 [10]101011

             00010     101011 = 000 1010 1011 = U+00AB = «

The following shows that JSON::XS produces the same output as Encode.pm:

use utf8;
use 5.18.0;
use JSON::XS;
use Encode;

foreach my $string ('Deliver «French Bread»', '日本国') {
    my $hashref = { value => $string };
    say(sprintf("Input: U+%v04X", $string));
    say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));

    my $json = encode_json($hashref);
    say(sprintf("JSON: %v02X", $json));
    say("");
}

Output (with some spaces added):

Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input:                     44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D

Input: U+65E5.672C.56FD
UTF-8 of input:                     E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D
like image 102
ikegami Avatar answered Nov 20 '22 14:11

ikegami


JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.

Issue 1: Test::utf8

Here are the two main situations when is_sane_utf8 will fail:

  1. You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs).
  2. You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread».

The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives.

Issue 2: Output Encoding

All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warnings since they're already encoded and being passed straight out.

Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder.

like image 4
Nova Patch Avatar answered Nov 20 '22 15:11

Nova Patch