Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to clean a form field for an XML Attribute that will contain valid UTF8 characters?

I have been struggling with this for a little while. I have a multi-lingual web app that outputs XML at some point. This XML can contain any language so my approach to sanitization has been to disallow certain characters that break XML from being inserted. That and wrapping as much as I can in CDATA, but I have a ton of content in the attributes. I don't want to disallow special characters because completely valid characters like parenthesis, periods, dashes, ticks and apostrophes are used all the time and they work.

What is the best way to strip out all characters that will break a XML attribute, but leave languages intact?

UPDATE:
I found: http://en.wikipedia.org/wiki/CDATA#CDATA-type_attribute_value , which indicated to me that I can describe an attribute as a CDATA section using DTD; however, this is not true it seems.

<?xml version="1.0" ?> 
<!DOCTYPE foo [
  <!ELEMENT foo EMPTY>
  <!ATTLIST foo a CDATA #REQUIRED>
]>
<foo a="&bull;"><![CDATA[ &bull; ]]> </foo>

Any validator will complain about bull not being an entity in the attribute. If you remove the attribute it will be valid. Also I hear schemas are the way to go, so if something like the above is possible but using an XML Schema instead, that would be awesome.

Thanks!

like image 790
Parris Avatar asked May 24 '12 18:05

Parris


1 Answers

this is valid

<?xml version="1.0" ?> 
<!DOCTYPE foo [
  <!ELEMENT foo EMPTY>
  <!ATTLIST foo a CDATA #REQUIRED>
]>
<foo a="&amp;bull;"><![CDATA[ &bull; ]]> </foo>

you can translate special characters to html entities with

htmlentities($str);

and reversing with

html_entity_decode($str);

see: http://www.php.net/manual/en/function.htmlentities.php

see also "html metacharacters"

like image 85
neu-rah Avatar answered Oct 26 '22 23:10

neu-rah