Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle invalid unicode with simplexml

Tags:

php

xml

Simplexml fails with the following error message:

simplexml_load_file(): fooo.xml:299108: parser error : Char 0xFFFE out of allowed range

From my understanding, the complaint is about an invalid unicode character. Line 299108 doesn't have "FFFE", but it contains "EFBFBE".

Is there way to handle this type of errors in simplexml?

like image 698
georg Avatar asked Oct 14 '11 10:10

georg


1 Answers

I was running into this a lot with incoming user data, and I researched many methods to solve it. There are ways to properly encode the incoming data as UTF-8, without the higher-order (or other) unicode values that often cause these problems.

However, the problem with the sanitizing solutions is that they change the data, and if you just want to be a middle man, you still want the output to contain these values. The only non-destructive way I could come up with to get a SimpleXMLElement reliably not fail, is to do this admittedly double-work solution:

    libxml_use_internal_errors(true);
    $dom = new DOMDocument("1.0", "UTF-8");
    $dom->strictErrorChecking = false;
    $dom->validateOnParse = false;
    $dom->recover = true;
    $dom->loadXML($xmlData);
    $xml = simplexml_import_dom($dom);

    libxml_clear_errors();
    libxml_use_internal_errors(false);

The trick is in looking at the properties of DOMDocument in PHP's documentation and noticing those extra variables that let you set parsing behavior. This method works without fail for me, on all the xml input that used to make SimpleXMLElement fail with character range issues.

My only guess on why it works is that SimpleXMLElement does the strict checking on initialization, but not when being initialized from an existing DOMDocument.

This method allows subsequent asXML() calls, without failure.

like image 181
Mike Venzke Avatar answered Oct 18 '22 21:10

Mike Venzke