Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is htmlentities() sufficient for creating xml-safe values?

I'm building an XML file from scratch and need to know if htmlentities() converts every character that could potentially break an XML file (and possibly UTF-8 data)?

The values will be from a twitter/flickr feed, so I need to be sure-

like image 304
John Himmelman Avatar asked May 12 '10 21:05

John Himmelman


People also ask

What is the purpose of Htmlentities () function?

The htmlentities() function converts characters to HTML entities. Tip: To convert HTML entities back to characters, use the html_entity_decode() function. Tip: Use the get_html_translation_table() function to return the translation table used by htmlentities().

What does Htmlspecialchars return?

This function returns a string with these conversions made. If you require all input substrings that have associated named entities to be translated, use htmlentities() instead.


2 Answers

htmlentities() is not a guaranteed way to build legal XML.

Use htmlspecialchars() instead of htmlentities() if this is all you are worried about. If you have encoding mismatches between the representation of your data and the encoding of your XML document, htmlentities() may serve to work around/cover them up (it will bloat your XML size in doing so). I believe it's better to get your encodings consistent and just use htmlspecialchars().

Also, be aware that if you pump the return value of htmlspecialchars() inside XML attributes delimited with single quotes, you will need to pass the ENT_QUOTES flag as well so that any single quotes in your source string are properly encoded as well. I suggest doing this anyway, as it makes your code immune to bugs resulting from someone using single quotes for XML attributes in the future.

Edit: To clarify:

htmlentities() will convert a number of non-ANSI characters (I assume this is what you mean by UTF-8 data) to entities (which are represented with just ANSI characters). However, it cannot do so for any characters which do not have a corresponding entity, and so cannot guarantee that its return value consists only of ANSI characters. That's why I 'm suggesting to not use it.

If encoding is a possible issue, handle it explicitly (e.g. with iconv()).

Edit 2: Improved answer taking into account Josh Davis's comment belowis .

like image 171
Jon Avatar answered Sep 28 '22 14:09

Jon


Dom::createTextNode() will automatically escape your content.

Example:

$dom = new DOMDocument; $element = $dom->createElement('Element'); $element->appendChild(     $dom->createTextNode('I am text with Ünicödé & HTML €ntities ©'));  $dom->appendChild($element); echo $dom->saveXml(); 

Output:

<?xml version="1.0"?> <Element>I am text with &#xDC;nic&#xF6;d&#xE9; &amp; HTML &#x20AC;ntities &#xA9;</Element> 

When you set the internal encoding to utf-8, e.g.

$dom->encoding = 'utf-8'; 

you'll still get

<?xml version="1.0" encoding="utf-8"?> <Element>I am text with Ünicödé &amp; HTML €ntities ©</Element> 

Note that the above is not the same as setting the second argument $value in Dom::createElement(). The method will only make sure your element names are valid. See the Notes on the manual page, e.g.

$dom = new DOMDocument; $element = $dom->createElement('Element', 'I am text with Ünicödé & HTML €ntities ©'); $dom->appendChild($element); $dom->encoding = 'utf-8'; echo $dom->saveXml(); 

will result in a Warning

Warning: DOMDocument::createElement(): unterminated entity reference  HTML €ntities © 

and the following output:

<?xml version="1.0" encoding="utf-8"?> <Element>I am text with Ünicödé </Element> 
like image 42
Gordon Avatar answered Sep 28 '22 14:09

Gordon