In short, is it possible to use a DTD to define an element as containing CDATA?
I'm calling a third party API that produces some invalid characters inside an element. Specifically, the data contains some HTML entities like ’
. When I attempt to parse this XML using SimpleXML, I of course get a parser error "Entity 'rsquo' not defined". Here's a simplistic example structure of what I'm dealing with:
<items>
<item>
<name>Jim Smith</name>
<description>Jim’s description breaks my parser</description>
</item>
</items>
Since I don't have control to fix the API response... I've resorted to this dirty trick to inject a CDATA section inside the problem element just before I try to parse it:
$xml = str_replace("<description>", "<description><![CDATA[", $xml);
$xml = str_replace("</description>", "]]></description>", $xml);
This fixes the issue for me, but the overhead is probably too big, don't you think? The XML can be anywhere between 30K to 100K of data.
I'd rather use a DTD but for the life of me I can't find any specs that allow for defining CDATA (in the same way I can define PCDATA). Below is what I'd like to do, but of course, it's invalid because of the '#CDATA' definition I'm trying to do:
<!DOCTYPE ITEMS [
<!ELEMENT ITEMS (ITEM)>
<!ELEMENT ITEM (NAME, DESCRIPTION)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT DESCRIPTION (#CDATA)>
]>
Thanks for any insights!
It is possible in SGML DTDs (e.g. the HTML 4.01 script element), but not in XML DTDs (hence the change for XHTML 1.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With