Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
  PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.

THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.

Illustrating the problem

Suppose

  $xmlFrag ='<p>Hello world! &#160;&#160; Let A&lt;B and A=&#x222C;dxdy</p>';

Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),

$xmlFrag = '<p>Hello world!    Let A&lt;B and A=∬dxdy</p>';

The text "A<B" needs an XML-reserved character, so MUST stay as A&lt;B.


Frustrated solutions

I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,

  $s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
                                                        // as I expected

Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.


I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,

  function xml_entity_decode($s) {
    // here an illustration (by user-defined function) 
    // about how the hypothetical PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 

    //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+

    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
  }  // you see? not need a benchmark: 
     //  it is not so fast as direct use of html_entity_decode; if there 
     //  was an XML-safe option was ideal.

PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.

like image 486
Peter Krauss Avatar asked Aug 04 '13 04:08

Peter Krauss


Video Answer


1 Answers

This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.

... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:

The best workaround

Pay attention:

  1. The function xml_entity_decode() below is the best (over any other) workaround.
  2. The function below is not an answer to the present question, it is only a workwaround.
  function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }  

To test and to demonstrate that you have a better solution, please test first with this simple benckmark:

  $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";
  
like image 54
3 revs Avatar answered Oct 19 '22 00:10

3 revs