Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to keep entities intact while parsing html with DomDocument?

I have this function to ensure every img tag has absolute URL:

function absoluteSrc($html, $encoding = 'utf-8')
{
    $dom = new DOMDocument();
    // Workaround to use proper encoding
    $prehtml  = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset={$encoding}\"></head><body>";
    $posthtml = "</body></html>";

    if($dom->loadHTML( $prehtml . trim($html) . $posthtml)){
        foreach($dom->getElementsByTagName('img') as $img){
            if($img instanceof DOMElement){
                $src = $img->getAttribute('src');
                if( strpos($src, 'http://') !== 0 ){
                    $img->setAttribute('src', 'http://my.server/' . $src);
                }
            }
        }

        $html = $dom->saveHTML();

        // Remove remains of workaround / DomDocument additions
        $cut_start  = strpos($html, '<body>') + 6;
        $cut_length = -1 * (1+strlen($posthtml));
        $html = substr($html, $cut_start, $cut_length);
    }
    return $html;
}

It works fine, but it returns decoded entities as unicode characters

$html = <<< EOHTML
<p><img src="images/lorem.jpg" alt="lorem" align="left">
Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
Cum magna. Suscipit sed vel tincidunt urna.<br>
Vel consequat pretium Curabitur faucibus justo adipiscing elit.
<img src="others/ipsum.png" alt="ipsum" align="right"></p>

<center>&copy; Dr&nbsp;Jekyll &#38; Mr&nbsp;Hyde</center>
EOHTML;

echo absoluteSrc($html);

Outputs:

<p><img src="http://my.server/images/lorem.jpg" alt="lorem" align="left">
Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
Cum magna. Suscipit sed vel tincidunt urna.<br>
Vel consequat pretium Curabitur faucibus justo adipiscing elit.
<img src="http://my.server/others/ipsum.png" alt="ipsum" align="right"></p>

<center>© Dr Jekyll &amp; Mr Hyde</center>

As you can see in the last line

  • &copy; is translated to © (U+00A9),
  • &nbsp; to non-breaking space (U+00A0),
  • &#38; to &amp;

I would like them to remain the same as in input string.

like image 908
dev-null-dweller Avatar asked Sep 16 '10 21:09

dev-null-dweller


2 Answers

The following code seems to work

   $dom= new DOMDocument('1.0', 'UTF-8');
   $dom->loadHTML($this->htmlentities2stringcode(rawurldecode($content)) );
   $dom->preserveWhiteSpace = true; 

   $innerHTML = str_replace("<html></html><html><body>", "", 
   str_replace("</body></html>", "", 
str_replace("+","%2B",str_replace("<p></p>", "", $this->getInnerHTML( $dom )))));

   return $this->stringcode2htmlentities($innerHTML));
}
// ----------------------------------------------------------
function htmlentities2stringcode($string) {
   // This method will convert htmlentities such as &copy; into the pseudo string version ^copy^; etc
        $from   = array_keys($this->getHTMLEntityStringCodeArray());
        $to     = array_values($this->getHTMLEntityStringCodeArray());
   return str_replace($from, $to, $string);
 }
 // ----------------------------------------------------------
 function stringcode2htmlentities ($string) {
    // This method will convert pseudo string such as ^copy^ to the original html entity &copy; etc
    $from   = array_values($this->getHTMLEntityStringCodeArray());
    $to     = array_keys($this->getHTMLEntityStringCodeArray());
    return str_replace($from, $to, $string);
  } 
  // -------------------------------------------------------------
  function getHTMLEntityStringCodeArray() {

      return array('&Alpha;'=>'^Alpha^', 
                                    '&Beta;'=>'^Beta^', 
                                    '&Chi;'=>'^Chi^', 
                                    '&Dagger;'=>'^Dagger^', 
                                    '&Delta;'=>'^Delta^', 
                                    '&Epsilon;'=>'^Epsilon^', 
                                    '&Eta;'=>'^Eta^', 
                                    '&Gamma;'=>'^Gamma^', 
                                    '&Iota;'=>'^lota^', 
                                    '&Kappa;'=>'^Kappa^', 
                                    '&Lambda;'=>'^Lambda^', 
                                    '&Mu;'=>'^Mu^', 
                                    '&Nu;'=>'^Nu^', 
                                    '&OElig;'=>'^OElig^', 
                                    '&Omega;'=>'^Omega^', 
                                    '&Omicron;'=>'^Omicron^',
                                    '&Phi;'=>'^Phi^', 
                                    '&Pi;'=>'^Pi^', 
                                    '&Prime;'=>'^Prime^', 
                                    '&Psi;'=>'^Psi^', 
                                    '&Rho;'=>'^Rho^', 
                                    '&Scaron;'=>'^Scaron^',
                                    '&Scaron;'=>'^Scaron^',
                                    '&Sigma;'=>'^Sigma^',
                                    '&Tau;'=>'^Tau^',
                                    '&Theta;'=>'^Theta^',
                                    '&Upsilon;'=>'^Upsilon^',
                                    '&Xi;'=>'^Xi^',
                                    '&Yuml;'=>'^Yuml^',
                                    '&Zeta;'=>'^Zeta^',
                                    '&alefsym;'=>'^alefsym^',
                                    '&alpha;'=>'^alpha^',
                                    '&and;'=>'^and^',
                                    '&ang;'=>'^ang^',
                                    '&asymp;'=>'^asymp^',
                                    '&bdquo;'=>'^bdquo^',
                                    '&beta;'=>'^beta^',
                                    '&bull;'=>'^bull^',
                                    '&cap;'=>'^cap^',
                                    '&chi;'=>'^chi^',
                                    '&circ;'=>'^circ^',
                                    '&clubs;'=>'^clubs^',
                                    '&cong;'=>'^cong^',
                                    '&crarr;'=>'^crarr^',
                                    '&cup;'=>'^cup^',
                                    '&dArr;'=>'^dArr^',
                                    '&dagger;'=>'^dagger^',
                                    '&darr;'=>'^darr^',
                                    '&delta;'=>'^delta^',
                                    '&diams;'=>'^diams^',
                                    '&empty;'=>'^empty^',
                                    '&emsp;'=>'^emsp^',
                                    '&ensp;'=>'^ensp^',
                                    '&epsilon;'=>'^epsilon^',
                                    '&equiv;'=>'^equiv^',
                                    '&eta;'=>'^eta^',
                                    '&euro;'=>'^euro^',
                                    '&exist;'=>'^exist^',
                                    '&fnof;'=>'^fnof^',
                                    '&forall;'=>'^forall^',
                                    '&frasl;'=>'^frasl^',
                                    '&gamma;'=>'^gamma^',
                                    '&ge;'=>'^ge^',
                                    '&hArr;'=>'^hArr^',
                                    '&harr;'=>'^harr^',
                                    '&hearts;'=>'^hearts^',
                                    '&hellip;'=>'^hellip^',
                                    '&image;'=>'^image^',
                                    '&infin;'=>'^infin^',
                                    '&int;'=>'^int^',
                                    '&iota;'=>'^iota^',
                                    '&isin;'=>'^isin^',
                                    '&kappa;'=>'^kappa^',
                                    '&lArr;'=>'^lArr^',
                                    '&lambda;'=>'^lambda^',
                                    '&lang;'=>'^lang^',
                                    '&larr;'=>'^larr^',
                                    '&lceil;'=>'^lceil^',
                                    '&ldquo;'=>'^ldquo^',
                                    '&le;'=>'^le^',
                                    '&lfloor;'=>'^lfloor^',
                                    '&lowast;'=>'^lowast^',
                                    '&loz;'=>'^loz^',
                                    '&lrm;'=>'^lrm^',
                                    '&lsaquo;'=>'^lsaquo^',
                                    '&lsquo;'=>'^lsquo^',
                                    '&mdash;'=>'^mdash^',
                                    '&minus;'=>'^minus^',
                                    '&mu;'=>'^mu^',
                                    '&nabla;'=>'^nabla^',
                                    '&ndash;'=>'^ndash^',
                                    '&ne;'=>'^ne^',
                                    '&ni;'=>'^ni^',
                                    '&notin;'=>'^notin^',
                                    '&nsub;'=>'^nsub^',
                                    '&nu;'=>'^nu^',
                                    '&oelig;'=>'^oelig^',
                                    '&oline;'=>'^oline^',
                                    '&omega;'=>'^omega^',
                                    '&omicron;'=>'^omicron^',
                                    '&oplus;'=>'^oplus^',
                                    '&or;'=>'^or^',
                                    '&otimes;'=>'^otimes^',
                                    '&part;'=>'^part^',
                                    '&permil;'=>'^permil^',
                                    '&perp;'=>'^perp^',
                                    '&phi;'=>'^phi^',
                                    '&pi;'=>'^pi^', 
                                    '&piv;'=>'^piv^',
                                    '&prime;'=>'^prime^',
                                    '&prod;'=>'^prod^',
                                    '&prop;'=>'^prop^',
                                    '&psi;'=>'^psi^',
                                    '&rArr;'=>'^rArr^',
                                    '&radic;'=>'^radic^',
                                    '&rang;'=>'^rang^',
                                    '&rarr;'=>'^rarr^',
                                    '&rceil;'=>'^rceil^',
                                    '&rdquo;'=>'^rdquo^',
                                    '&real;'=>'^real^',
                                    '&rfloor;'=>'^rfloor^',
                                    '&rho;'=>'^rho^',
                                    '&rlm;'=>'^rlm^',
                                    '&rsaquo;'=>'^rsaquo^',
                                    '&rsquo;'=>'^rsquo^',
                                    '&sbquo;'=>'^sbquo^',
                                    '&scaron;'=>'^scaron^',
                                    '&sdot;'=>'^sdot^',
                                    '&sigma;'=>'^sigma^',
                                    '&sigmaf;'=>'^sigmaf^',
                                    '&sim;'=>'^sim^',
                                    '&spades;'=>'^spades^',
                                    '&sub;'=>'^sub^',
                                    '&sube;'=>'^sube^',
                                    '&sum;'=>'^sum^',
                                    '&sup;'=>'^sup^',
                                    '&supe;'=>'^supe^',
                                    '&tau;'=>'^tau^',
                                    '&there4;'=>'^there4^',
                                    '&theta;'=>'^thetasym^',
                                    '&thetasym;'=>'^thetasym^',
                                    '&thinsp;'=>'^thinsp^',
                                    '&tilde;'=>'^tilde^',
                                    '&trade;'=>'^trade^',
                                    '&uArr;'=>'^uArr^',
                                    '&uarr;'=>'^uarr^',
                                    '&upsih;'=>'^upsih^',
                                    '&upsilon;'=>'^upsilon^',
                                    '&weierp;'=>'^weierp^',
                                    '&xi;'=>'^xi^',
                                    '&yuml;'=>'^yuml^',
                                    '&zeta;'=>'^zeta^',
                                    '&zwj;'=>'^zwj^',
                                    '&zwnj;'=>'^zwnj^');
    }
like image 108
Terry Avatar answered Oct 09 '22 00:10

Terry


An alternative solution is to use DOMDocument->saveHTMLFile() (which doesn't convert HTML entities) and read the contents of the saved file back into a string.

It's not super pretty, but it has the advantage of not having to manually find-and-replace entity codes yourself (twice) as per some other solutions proffered here.

like image 1
Gavin Ballard Avatar answered Oct 09 '22 00:10

Gavin Ballard