Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplexml_load_string() fail to parse error

I'm trying to load parse a Google Weather API response (Chinese response).

Here is the API call.

// This code fails with the following error
$xml = simplexml_load_file('http://www.google.com/ig/api?weather=11791&hl=zh-CN');

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xB6 0xE0 0xD4 0xC6 in C:\htdocs\weather.php on line 11

Why does loading this response fail?

How do I encode/decode the response so that simplexml loads it properly?

Edit: Here is the code and output.

<?php
$googleData = file_get_contents('http://www.google.com/ig/api?weather=11102&hl=zh-CN');
$xml = simplexml_load_string($googleData);

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xB6 0xE0 0xD4 0xC6 in C:\htdocs\test4.php on line 3 Call Stack Time Memory Function Location 1 0.0020 314264 {main}( ) ..\test4.php:0 2 0.1535 317520 simplexml_load_string ( string(1364) ) ..\test4.php:3

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: t_system data="SI"/>

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in C:\htdocs\test4.php on line 3 Call Stack Time Memory Function Location 1 0.0020 314264 {main}( ) ..\test4.php:0 2 0.1535 317520 simplexml_load_string ( string(1364) ) ..\test4.php:3

like image 938
John Himmelman Avatar asked May 24 '10 18:05

John Himmelman


3 Answers

The problem here is that SimpleXML doesn't look at the HTTP header to determine the character encoding used in the document and simply assumes it's UTF-8 even though Google's server does advertise it as

Content-Type: text/xml; charset=GB2312

You can write a function that will take a look at that header using the super-secret magic variable $http_response_header and transform the response accordingly. Something like that:

function sxe($url)
{   
    $xml = file_get_contents($url);
    foreach ($http_response_header as $header)
    {   
        if (preg_match('#^Content-Type: text/xml; charset=(.*)#i', $header, $m))
        {   
            switch (strtolower($m[1]))
            {   
                case 'utf-8':
                    // do nothing
                    break;

                case 'iso-8859-1':
                    $xml = utf8_encode($xml);
                    break;

                default:
                    $xml = iconv($m[1], 'utf-8', $xml);
            }
            break;
        }
    }

    return simplexml_load_string($xml);
}
like image 134
Josh Davis Avatar answered Oct 05 '22 04:10

Josh Davis


Update: I can reproduce the problem. Also, Firefox is auto-sniffing the character set as "chinese simplified" when I output the raw XML feed. Either the Google feed is serving incorrect data (Chinese Simplified characters instead of UTF-8 ones), or it is serving different data when not fetched in a browser - the content-type header in Firefox clearly says utf-8.

Converting the incoming feed from Chinese Simplified (GB18030, this is what Firefox gave me) into UTF-8 works:

 $incoming = file_get_contents('http://www.google.com/ig/api?weather=11791&hl=zh-CN');
 $xml = iconv("GB18030", "utf-8", $incoming);
 $xml = simplexml_load_string($xml);

it doesn't explain nor fix the underlying problem yet, though. I don't have time to take a deep look into this right now, maybe somebody else does. To me, it looks like Google are in fact serving incorrect data (which would surprise me. I didn't know they made mistakes like us mortals. :P)

like image 40
Pekka Avatar answered Oct 05 '22 03:10

Pekka


Just came accross this. This seems to work (the function itself I found on the web, just updated it a bit).:

header('Content-Type: text/html; charset=utf-8'); 


function getWeather() {

$requestAddress = "http://www.google.com/ig/api?weather=11791&hl=zh-CN";
// Downloads weather data based on location.
$xml_str = file_get_contents($requestAddress,0);
$xml_str = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $xml_str); 

$xml_str = iconv("GB18030", "utf-8", $xml_str);


// Parses XML
$xml = new SimplexmlElement($xml_str, TRUE);
// Loops XML
$count = 0;
echo '<div id="weather">';

foreach($xml->weather as $item) {

    foreach($item->forecast_conditions as $new) {

        echo "<div class=\"weatherIcon\">\n";
         echo "<img src='http://www.google.com/" .$new->icon['data'] . "'   alt='".$new->condition['data']."'/><br>\n";
        echo "<b>".$new->day_of_week['data']."</b><br>";
        echo "Low: ".$new->low['data']." &nbsp;High: ".$new->high['data']."<br>";
        echo "\n</div>\n";
        }

}

echo '</div>';
}


getWeather();
like image 35
AR. Avatar answered Oct 05 '22 05:10

AR.