Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding troubles - one format to another

I have a scraper that is collecting some data from elsewhere that I have no control over. The source data does all sorts of interesting Unicode characters but it converts them to a pretty unhelpful format, so

\u00e4

for a small 'a' with umlaut (sans the double quotes that I think are supposed to be there)*. of course this gets rendered in my HTML as plain text.

Is there any realistic way to convert the unicode source into proper characters that doesn't involve me manually crunching out every single string sequence and replacing them during the scrape?

*here is a sample of the json that it spits out:

({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})
like image 579
hollsk Avatar asked Jul 16 '10 20:07

hollsk


2 Answers

Considering \u00e4 is the Javascript representation of an Unicode character, a possibility could be to use the json_decode() PHP function, to decode that to a PHP string...

The valid JSON string would be :

$json = '"\u00e4"';

And this :

header('Content-type: text/html; charset=UTF-8');
$php = json_decode($json);
var_dump($php);

would give you the right output :

string 'ä' (length=2)

(It's one character, but two bytes long)


Still, it feels a bit hackish ^^
And it might not work too well, depending on the kind of string you get as input...

[Edit] I've just seen your comment where you seem to indicate you get JSON as input ? If so, json_decode() might really be the right tool for the job ;-)

like image 183
Pascal MARTIN Avatar answered Sep 28 '22 15:09

Pascal MARTIN


The accepted Answer wouldn't work if you try to use the JSON Encode somewhere between the Page execution (e.g. as Plugin for some CMS) or cannot set the header Information. But of course, the Page Header should been set always correctly.

You can provide the json_encode / json_decode Function with additional Parameters to "force" it to use utf-8. I'm building a simple Class for this and using static Methods to get my results.

The key for this is the Flag JSON_UNESCAPED_UNICODE. Use it like this:

Data Class

/*
    Data Class
    * * * * * * *
    Encode and Decode Your String / Object / Array with utf-8 force.
*/
class Data {

    // Encode
    // @param $a  Array Element to decode in JSON
    public static function encode($a=[]){
        $json = json_encode($a, JSON_UNESCAPED_UNICODE);
        return $json;
    }

    // Decode
    // @param $a  JSON String
    // @param $t  Type of return (false = Array, true = Object)
    public static function decode($a='', $t=false){
        $obj = json_decode($a, $t, 512, JSON_UNESCAPED_UNICODE);
        return $obj;
    }
}

Usage

// Get your JSON String
$some_json_string = file_get_contents(YOUR_URL);

// Decode as wish
$json_as_array    = Data::decode($some_json_string);
$json_as_object   = Data::decode($some_json_string, true);

// Debug / use your Content 
echo "<pre>";
print_r($json_as_array);
print_r($json_as_object);
echo "</pre>";
like image 45
Gkiokan Avatar answered Sep 28 '22 16:09

Gkiokan