Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP, MySQL and XML = garbled HTML output

I have a field in MySQL of type text, using the following collation: utf8_general_ci.

This XML field is populated using a variable built using DOMDocument:

function ed_audit_node($dom, $field, $new, $old){

    //create audit_detail node
    $ad = $dom->createElement('audit_detail');

    $fn = $dom->createElement('fieldname');
    $fn->appendChild($dom->createTextNode($field));
    $ad->appendChild($fn);

    $ov = $dom->createElement('old_value');
    $ov->appendChild($dom->createTextNode($old));
    $ad->appendChild($ov);

    $nv = $dom->createElement('new_value');
    $nv->appendChild($dom->createTextNode($new));
    $ad->appendChild($nv);

    //append to document
    return $ad;
}

Here's how I save to the db ( $xml comes from $dom->saveXML() ):

function ed_audit_insert($ed, $xml){
    global $visitor;

    $sql = <<<EOF
    INSERT INTO ed.audit
    (employee_id, audit_date, audit_action, audit_data, user_id) 
    VALUES (
        {$ed[emp][employee_id]}, 
        now(), 
        '{$ed[audit_action]}', 
        '{$xml}', 
        {$visitor[user_id]}
    );      
EOF;
    $req = mysql_query($sql,$ed['db']) or die(db_query_error($sql,mysql_error(),__FUNCTION__));
//snip  
}

See an older, parallel, slightly related thread on how I’m creating this XML: Another PHP XML parsing error: "Input is not proper UTF-8, indicate encoding!"

What works: - querying the database, selecting the field and outputting it using jQuery (.ajax()) and populating a textarea. Firebug and the textarea match what's in the database (confirmed with Toad).

What doesn't work: - outputting the text from the database into an HTML page. This HTML page has the content-type ISO-8859-1, which I cannot change.

Here’s the code that outputs that to the screen:

$xmlData = simplexml_load_string($d['audit_data']);

foreach ($xmlData->audit_detail as $a){
    echo "<p> straight from db = ".$a->new_value."</p>";
    echo "<p> utf8_decode() = ".utf8_decode($a->new_value)."</p>";
} 

I’ve also used a charset changer extension for Firefox: tried ISO-8859-1, UTF-8 and 1252 without success.

If it was UTF-8, shouldn’t I be seeing diamonds with question marks inside (since it's content-type = ISO-8859-1)? If it’s not UTF-8, what is it?

Edit #1

Here's snapshot of other tests that I have made:

$xmlData = simplexml_load_string($d['audit_data']);
foreach ($xmlData->audit_detail as $a){
    echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>";
    echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>";
    echo "<hr/>";
    echo "<p> straight from db = <pre>".$a->new_value."</pre></p>";
    echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>";
    echo "<hr/>";
    $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value);
    $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8');
    echo "<p> iconv() = ".$iso88591_2."</p>";
    echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>";
}

Edit #2

I added the FF proprietary tag, xmp.

Code:

$xmlData = simplexml_load_string($d['audit_data']);

foreach ($xmlData->audit_detail as $a){
    echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>";
    echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>";
    echo "<hr/>";
    echo "<p> straight from db = <pre>".$a->new_value."</pre></p>";
    echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>";
    echo "<hr/>";
    $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value);
    $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8');
    echo "<p> iconv() = ".$iso88591_2."</p>";
    echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>";
    echo "<hr/>";
    echo "<p>straight from db, using &lt;xmp&gt;  = <xmp>".$a->new_value."</xmp></p>";
    echo "<p>utf8_decode(), using &lt;xmp&gt; = <xmp>".utf8_decode($a->new_value)."</xmp></p>";

}

Here are some meta tags from the page:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="dc.language" scheme="ISO639-2/T" content="eng" />

IMO, the last meta tag has no bearing.

Edit #3

Source code:

<p>encoding is, straight from db, using mb_detect_encoding: UTF-8</p><p>encoding is, with utf8_decode, using mb_detect_encoding: ASCII</p><hr/><p> straight from db = <pre>Ro马eç ³é ¥n franê¡©s</pre></p><p> utf8_decode() = <pre>Ro?e??n fran?s</pre></p><hr/><p> iconv() = Ro</p><p> mb_convert_encoding() = Ro?e??n fran?s</p><hr/><p>straight from db, using &lt;xmp&gt;  = <xmp>Ro马eç ³é ¥n franê¡©s</xmp></p><p>utf8_decode(), using &lt;xmp&gt; = <xmp>Ro?e??n fran?s</xmp></p>

Edit #4

Here is the SQL statement going in to the db:

INSERT INTO ed.audit
    (employee_id, audit_date, audit_action, audit_data, user_id) 
    VALUES (
        75, 
        now(), 
        'u', 
        '<?xml version="1.0"?>
<audit><audit_detail><fieldname>role_fra</fieldname><old_value>aRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s</old_value><new_value>bRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s</new_value></audit_detail></audit>
', 
        333
    );

! Note, the text from this XML doesn't necessarily match the screenshots provided above.

Edit #5

Here's my new function that wraps the CDATA tag around my values for the old_value and new_value nodes:

function ed_audit_node($dom, $field, $new, $old){

    //create audit_detail node
    $ad = $dom->createElement('audit_detail');

    $fn = $dom->createElement('fieldname');
    $fn->appendChild($dom->createTextNode($field));
    $ad->appendChild($fn);

    $ov = $dom->createElement('old_value');

    $ov->appendChild($dom->createCDATASection($old));
    $ad->appendChild($ov);

    $nv = $dom->createElement('new_value');
    $nv->appendChild($dom->createCDATASection($new));
    $ad->appendChild($nv);

    //append to document
    return $ad;
}

I also added the encoding to the XML document:

$dom = new DomDocument('1.0', 'UTF-8');

Here's my new simpleXML call:

$xmlData = simplexml_load_string($d['audit_data'], "SimpleXMLElement", LIBXML_NOENT | LIBXML_NOCDATA);

I see the CDATA tags in Toad as well. However, I'm still getting an error:

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 2: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0xE9 0x6C 0x65 in <snip>

Edit #6

I just noticed that the jQuery call returns the proper accented characters in the CDATA.

like image 854
TechFanDan Avatar asked Jan 13 '11 17:01

TechFanDan


1 Answers

Technically your string is in UTF8, but the HTML encoded characters (when rendered by the browser) are not in UTF8. So &#xa869; is a valid UTF8 String, but the character that is rendered onto the screen from the web browser is not valid UTF8.

I would also wrap your echo to the screen (last 2 lines in your example) like this:

echo "<p>straight from db = <xmp>".$a->new_value."</xmp></p>";
echo "<p>utf8_decode() = <xmp>".utf8_decode($a->new_value)."</xmp></p>";

This will clearly display the point i am making above.

Edit:

The problem actually is an uncontrollable undocumented "Feature" in simplexml_load_string() for PHP. It will automatically convert all characters from their XML entity form strait into their actual char form. The only way to circumvent this is to use simplexml_load_string() like this:

 $data = simplexml_load_string(
      '<?xml version="1.0" encoding="utf-8"?> 
           <audit>
                <audit_detail>
                     <fieldname>role_fra</fieldname>
                     <old_value><![CDATA[aRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s]]></old_value>
                     <new_value><![CDATA[bRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s]]></new_value>
                </audit_detail>
           </audit>', 
      "SimpleXMLElement", 
      LIBXML_NOENT | LIBXML_NOCDATA
 );
 print "<PRE>";
 print_r($data);
 exit;

You must wrap your elements in <![CDATA[]]> tags and then pass the LIBXML_NOCDATA option to the xml parser. This will force the things in <![CDATA[]]> tags to be cast as String type and PHP can properly handle that outside of a SimpleXMLObject.

like image 75
Geoffrey Wagner Avatar answered Sep 29 '22 16:09

Geoffrey Wagner