Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does DOMDocument::saveHTML()'s behavior differ in encoding UTF-8 as entities in style & script elements?

Given a DOMDocument constructed with a stylesheet that contains an emoji character, I've found some strange behavior when serializing the DOM back out to HTML.

$html = <<< HTML
<!DOCTYPE html>
<html>
<head>
  <meta charset=utf-8>
  <style>span::before{ content: "⚡️"; }</style>
</head>
<body>
  <span></span>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);

echo $dom->saveHTML($dom->documentElement);
echo $dom->saveHTML();

The result of $dom->saveHTML($dom->documentElement) is (as desired):

<html><head><meta charset="utf-8">
<style>span::before{ content: "⚡️"; }</style>
</head><body><span></span></body></html>

But $dom->saveHTML() returns (erroneously):

<html><head><meta charset="utf-8">
<style>span::before{ content: "&#9889;&#65039;"; }</style>
</head><body><span></span></body></html>

Notice how the emoji “⚡️” is encoded as the HTML entities &#9889;&#65039; inside of the stylesheet. It is treated as a literal string since CSS escape \26A1 should be used instead.

I tried setting $dom->substituteEntities = false but without any effect.

The same HTML entity conversion is also happening inside of <script> elements, which causes similar problems in browsers.

Test via online PHP shell: https://3v4l.org/jMfDd

like image 221
Weston Ruter Avatar asked Nov 30 '25 19:11

Weston Ruter


1 Answers

You should convert the encoding before loading the HTML with emojis on DOMDocument:

$dom->loadHTML(mb_convert_encoding($htmlCode, 'HTML-ENTITIES', 'UTF-8'));

EDIT: As mention by post owner, mb_convert_enconding is deprecated in future PHP versions (currently tested on 8.2.5 and works fine). For later versions of PHP take a look at https://php.watch/versions/8.2/mbstring-qprint-base64-uuencode-html-entities-deprecated#html

like image 75
Mariano Argañaraz Avatar answered Dec 02 '25 10:12

Mariano Argañaraz