I have been struggling with this for a little while. I have a multi-lingual web app that outputs XML at some point. This XML can contain any language so my approach to sanitization has been to disallow certain characters that break XML from being inserted. That and wrapping as much as I can in CDATA, but I have a ton of content in the attributes. I don't want to disallow special characters because completely valid characters like parenthesis, periods, dashes, ticks and apostrophes are used all the time and they work.
What is the best way to strip out all characters that will break a XML attribute, but leave languages intact?
UPDATE:
I found: http://en.wikipedia.org/wiki/CDATA#CDATA-type_attribute_value , which indicated to me that I can describe an attribute as a CDATA section using DTD; however, this is not true it seems.
<?xml version="1.0" ?>
<!DOCTYPE foo [
<!ELEMENT foo EMPTY>
<!ATTLIST foo a CDATA #REQUIRED>
]>
<foo a="•"><![CDATA[ • ]]> </foo>
Any validator will complain about bull not being an entity in the attribute. If you remove the attribute it will be valid. Also I hear schemas are the way to go, so if something like the above is possible but using an XML Schema instead, that would be awesome.
Thanks!
this is valid
<?xml version="1.0" ?>
<!DOCTYPE foo [
<!ELEMENT foo EMPTY>
<!ATTLIST foo a CDATA #REQUIRED>
]>
<foo a="&bull;"><![CDATA[ • ]]> </foo>
you can translate special characters to html entities with
htmlentities($str);
and reversing with
html_entity_decode($str);
see: http://www.php.net/manual/en/function.htmlentities.php
see also "html metacharacters"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With