Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perl: convert a string to utf-8 for json decode

Tags:

json

utf-8

perl

I'm crawling a website and collecting information from its JSON. The results are saved in a hash. But some of the pages give me "malformed UTF-8 character in JSON string" error. I notice that the last letter in "cafe" will produce error. I think it is because of the mix of character types. So now I'm looking for a way to convert all types of character to utf-8 (hope there is a way perfect like that). I tried utf8::all, it just doesn't work (maybe I didn't do it right). I'm a noob. Please help, thanks.


UPDATE

Well, after I read the article "Know the difference between character strings and UTF-8 strings" Posted by brian d foy. I solve the problem with the codes:

use utf8;
use Encode qw(encode_utf8);
use JSON;


my $json_data = qq( { "cat" : "Büster" } );
$json_data = encode_utf8( $json_data );

my $perl_hash = decode_json( $json_data );

Hope this help some one else.

like image 384
Ivan Wang Avatar asked May 22 '12 18:05

Ivan Wang


People also ask

How to decode JSON in Perl?

Decoding JSON in Perl (decode_json)Perl decode_json() function is used for decoding JSON in Perl. This function returns the value decoded from json to an appropriate Perl type.

Can JSON be UTF-8 encoded?

The default encoding is UTF-8. (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used.

How do I encode a string in Perl?

$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.


1 Answers

decode_json expects the JSON to have been encoded using UTF-8.

While your source file is encoded using UTF-8, you have Perl decode it by using use utf8; (as you should). This means your string contains Unicode characters, not the UTF-8 bytes that represent those characters.

As you've shown, you could encode the string before passing it to decode_json.

use utf8;
use Encode qw( encode_utf8 );
use JSON   qw( decode_json );

my $data_json = qq( { "cat" : "Büster" } );
my $data = JSON->new->utf8(1)->decode(encode_utf8($data_json));
   -or-
my $data = JSON->new->utf8->decode(encode_utf8($data_json));
   -or-
my $data = decode_json(encode_utf8($data_json));

But you could simply tell JSON that the string is already decoded.

use utf8;
use JSON qw( from_json );

my $data_json = qq( { "cat" : "Büster" } );
my $data = JSON->new->utf8(0)->decode($data_json);
   -or-
my $data = JSON->new->decode($data_json);
   -or-
my $data = from_json($data_json);
like image 112
ikegami Avatar answered Oct 10 '22 10:10

ikegami