Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl's JSON::XS not encoding UTF8 correctly?

Tags:

json

cgi

perl

This simple code segment shows an issue I am having with JSON::XS encoding in Perl:

#!/usr/bin/perl
use strict;
use warnings;
use JSON::XS; 
use utf8;
binmode STDOUT, ":encoding(utf8)";

my (%data);

$data{code} = "Gewürztraminer";
print "data{code} = " . $data{code} . "\n";

my $json_text = encode_json \%data;
print $json_text . "\n";

The output this yields is:

johnnyb@boogie:~/Projects/repos > ./jsontest.pl 
data{code} = Gewürztraminer
{"code":"Gewürztraminer"}

Now if I comment out the binmode line above I get:

johnnyb@boogie:~/Projects/repos > ./jsontest.pl 
data{code} = Gew�rztraminer
{"code":"Gewürztraminer"}

What is happening here? Note that I am trying to fix this behavior in a perl CGI script in which binmode can not be used but I always get the "ü" characters as above returned in the JSON stream. How do I debug this? What am I missing?

like image 852
Omortis Avatar asked Jul 01 '15 19:07

Omortis


2 Answers

encode_json (short for JSON::XS->new->utf8->encode) encodes using UTF-8, then you are re-encoding it by printing it to STDOUT to which you've added an encoding layer. Effectively, you are doing encode_utf8(encode_utf8($uncoded_json)).

Solution 1

use open ':std', ':encoding(utf8)';  # Defaults
binmode STDOUT;                      # Override defaults
print encode_json(\%data);

Solution 2

use open ':std', ':encoding(utf8)';    # Defaults
print JSON::XS->new->encode(\%data);   # Or to_json from JSON.pm

Solution 3

The following works with any encoding on STDOUT by using \u escapes for non-ASCII:

print JSON::XS->new->ascii->encode(\%data);

In the comments, you mention it's actually a CGI script.

#!/usr/bin/perl
use strict;
use warnings;

use utf8;                      # Encoding of source code.
use open ':encoding(UTF-8)';   # Default encoding of file handles.
BEGIN {
   binmode STDIN;                       # Usually does nothing on non-Windows.
   binmode STDOUT;                      # Usually does nothing on non-Windows.
   binmode STDERR, ':encoding(UTF-8)';  # For text sent to the log file.
}

use CGI      qw( -utf8 );
use JSON::XS qw( ); 

{
   my $cgi = CGI->new();
   my $data = { code => "Gewürztraminer" };
   print $cgi->header('application/json');
   print encode_json($data);
}
like image 184
ikegami Avatar answered Nov 20 '22 03:11

ikegami


JSON::XS encodes its output into octets. It means the external representation of encoded utf8 string, but it is not unicode string. For more details see perlunicode. In short, content of $json_text is prepared for transmitting by IO handler in binary code. If you create scalar content of $data{code} after use utf8; you have scalar containing internally encoded unicode characters string. (Which is internally encoded as utf8 but it is implementation detail you should not rely on. Pragma use utf8; means the source code is encoded as utf8 and nothing else.) If you would like to output both scalars in utf8 encoded IO handler you have to transform $json_string into internal unicode chracters string.

use strict;
use warnings;
use JSON::XS; 
use utf8;
binmode STDOUT, ":encoding(utf8)";

my (%data);

$data{code} = "Gewürztraminer";
print "data{code} = " . $data{code} . "\n";

my $json_text = encode_json \%data;
utf8::decode($json_text);
print $json_text . "\n";

Or how it is intended to use, output encoded string using IO handler in binary mode.

my $json_text = encode_json \%data;
binmode STDOUT;
print $json_text . "\n";

Try

print utf8::is_utf8($json_text) ? "UTF8" : "OCTETS" . "\n";

to see what is inside.

like image 24
Hynek -Pichi- Vychodil Avatar answered Nov 20 '22 05:11

Hynek -Pichi- Vychodil