Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xml parse error: 'Invalid character'

Tags:

php

xml

I'm using the google weather api for a widget.

All is fine and dandy except that today I encountered a problem that I cannot solve. When called with this location:

http://www.google.com/ig/api?weather=dunjkovec,medimurska,croatia&hl=en

I get this error:

XML parse error 9 'Invalid character' at line 1, column 169 (byte index 199)

I suspect that the problem is here: Nedelišće

The code block is this one:

$parser = xml_parser_create('UTF-8');
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
$ok = xml_parse_into_struct($parser, $data, $values);
if (!$ok) {
    $errmsg = sprintf("XML parse error %d '%s' at line %d, column %d (byte index %d)",
    xml_get_error_code($parser),
    xml_error_string(xml_get_error_code($parser)),
    xml_get_current_line_number($parser),
    xml_get_current_column_number($parser),
    xml_get_current_byte_index($parser));
}

$data is the content of the xml and $values is empty.

Can someone help me? Thank you very much!

EDIT----------------------------------

After reading Hussein's post I discovered that the problem is in the way the file gets retrieved.

I tried file_get_contents and cURL. Both returns:

that is the line that creates problems. Or so I thought! I tried this html_entity_decode($data,ENT_NOQUOTES,'UTF-8') and it wasn't working, so I made a discover, I can't echo the contents of the xml, I can only print_r them and see the results in the html source! With any other location in the world it works, only this one creates problems... I wanna cry :-(

EDIT 2--------------------------------

For anybody that cares. I fixed the problem with this lines of code after retrieving the xml file from the api:

$data = mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data, 'UTF-8, ISO-8859-1', true));
$data = html_entity_decode($data,ENT_NOQUOTES,'UTF-8'); 

then parse the xml, it works like a charm. I marked hussein's answer because it got me on the right track.

like image 438
0plus1 Avatar asked Jan 04 '11 09:01

0plus1


1 Answers

After reading at your problem, I tried same thing on my machine. What I did is 1. Downloaded xml file on my local machine from the URL you posted. 2. Used your xml parsing script to prepare structure from XML.

Amazingly it worked perfectly on my machine, even though XML has Nedelišće keyword. So, I see the problem in the way of reading XML file.

It would be easy to debug if you can tell me the way you are reading the xml form google api. Are you using CURL?

EDIT -----------------------------------------------

Hi 0plus1,

I have prepared one helper function to convert those special chars to html for making it able for parsing..

I am pasting entire code here. Use following script..

function utf8tohtml($utf8, $encodeTags)
{
    $result = '';
    for ($i = 0; $i < strlen($utf8); $i++)
    {
        $char = $utf8[$i];
        $ascii = ord($char);
        if ($ascii < 128)
        {
            // one-byte character
            $result .= ($encodeTags) ? htmlentities($char , ENT_QUOTES, 'UTF-8') : $char;
        } else if ($ascii < 192)
        {
            // non-utf8 character or not a start byte
        } else if ($ascii < 224)
        {
            // two-byte character
            $result .= htmlentities(substr($utf8, $i, 2), ENT_QUOTES, 'UTF-8');
            $i++;
        } else if ($ascii < 240)
        {
            // three-byte character
            $ascii1 = ord($utf8[$i+1]);
            $ascii2 = ord($utf8[$i+2]);
            $unicode = (15 & $ascii) * 4096 +
                (63 & $ascii1) * 64 +
                (63 & $ascii2);
            $result .= "&#$unicode;";
            $i += 2;
        } else if ($ascii < 248)
        {
            // four-byte character
            $ascii1 = ord($utf8[$i+1]);
            $ascii2 = ord($utf8[$i+2]);
            $ascii3 = ord($utf8[$i+3]);
            $unicode = (15 & $ascii) * 262144 +
                (63 & $ascii1) * 4096 +
                (63 & $ascii2) * 64 +
                (63 & $ascii3);
            $result .= "&#$unicode;";
            $i += 3;
        }
    }
    return $result;
}


$curlHandle = curl_init();
$serviceUrl = "http://www.google.com/ig/api?weather=dunjkovec,medimurska,croatia&hl=en";
// setup the basic options for the curl
curl_setopt($curlHandle , CURLOPT_URL, $serviceUrl);
curl_setopt($curlHandle , CURLOPT_HEADER , 0);
curl_setopt($curlHandle , CURLOPT_HTTPHEADER , array("Cache-Control: no-cache","Content-type: application/x-www-form-urlencoded;charset=UTF-8"));
curl_setopt($curlHandle , CURLOPT_FOLLOWLOCATION , true);
curl_setopt($curlHandle , CURLOPT_RETURNTRANSFER , true);
curl_setopt($curlHandle , CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
$data = curl_exec($curlHandle);
// echo $data;
$data = utf8tohtml($data , false);
echo $data;

$parser = xml_parser_create("UTF-8");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
$ok = xml_parse_into_struct($parser, $data, $values);
if (!$ok) {
    $errmsg = sprintf("XML parse error %d '%s' at line %d, column %d (byte index %d)",
    xml_get_error_code($parser),
    xml_error_string(xml_get_error_code($parser)),
    xml_get_current_line_number($parser),
    xml_get_current_column_number($parser),
    xml_get_current_byte_index($parser));
}
echo "<pre>";
print_r($values);
echo "</pre>";

Hope this will help.

Thanks!

Hussain.

like image 125
eHussain Avatar answered Sep 23 '22 12:09

eHussain