Given a DOMDocument constructed with a stylesheet that contains an emoji character, I've found some strange behavior when serializing the DOM back out to HTML.
$html = <<< HTML
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<style>span::before{ content: "⚡️"; }</style>
</head>
<body>
<span></span>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $dom->saveHTML($dom->documentElement);
echo $dom->saveHTML();
The result of $dom->saveHTML($dom->documentElement) is (as desired):
<html><head><meta charset="utf-8">
<style>span::before{ content: "⚡️"; }</style>
</head><body><span></span></body></html>
But $dom->saveHTML() returns (erroneously):
<html><head><meta charset="utf-8">
<style>span::before{ content: "⚡️"; }</style>
</head><body><span></span></body></html>
Notice how the emoji “⚡️” is encoded as the HTML entities ⚡️ inside of the stylesheet. It is treated as a literal string since CSS escape \26A1 should be used instead.
I tried setting $dom->substituteEntities = false but without any effect.
The same HTML entity conversion is also happening inside of <script> elements, which causes similar problems in browsers.
Test via online PHP shell: https://3v4l.org/jMfDd
You should convert the encoding before loading the HTML with emojis on DOMDocument:
$dom->loadHTML(mb_convert_encoding($htmlCode, 'HTML-ENTITIES', 'UTF-8'));
EDIT: As mention by post owner, mb_convert_enconding is deprecated in future PHP versions (currently tested on 8.2.5 and works fine). For later versions of PHP take a look at https://php.watch/versions/8.2/mbstring-qprint-base64-uuencode-html-entities-deprecated#html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With