Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java: remove cdata tag from xml

xpath is nice for parsing xml files, but its not working for data inside the cdata tag:

<![CDATA[ Some Text <p>more text and tags</p>... ]]>

My solution: Get the content of the xml first and remove

"<![CDATA["  and  "]]>".

After that I would run xpath "to reach everything" from the xml file. Is there a better solution? If not, how can I do it with a regular expression?

like image 717
SandyBr Avatar asked Jul 26 '11 21:07

SandyBr


2 Answers

The reason for the CDATA tags there is that everything inside them is pure text, nothing which should be interpreted directly as XML. You could write your document fragment in the question alternatively as

 Some Text &lt;p&gt;more text and tags&lt;/p&gt;... 

(with a leading and trailing space).

If you really want to interpret this as XML, extract the text from your document, and submit it to an XML parser again.

like image 183
Paŭlo Ebermann Avatar answered Sep 17 '22 14:09

Paŭlo Ebermann


To strip the CDATA and keep the tags as tags, you could use XSLT.

Given this XML input:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
    <child>Here is some text.</child>
    <child><![CDATA[Here is more text <p>with tags</p>.]]></child>
</root>

Using this XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*" />
            <xsl:value-of select="text()" disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Will return the following XML:

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <child>Here is some text.</child>
   <child>Here is more text <p>with tags</p>.</child>
</root>

(Tested with Saxon HE 9.3.0.5 in oXygen 12.2)

Then you could use xPath to extract the contents of the p element:

/root/child/p
like image 42
james.garriss Avatar answered Sep 18 '22 14:09

james.garriss