I need to remove white spaces between xml tags, e.g. if the original xml looks like:
<node1>
<node2>
<node3>foo</node3>
</node2>
</node1>
I'd like the end-result to be crunched down to single line:
<node1><node2><node3>foo</node3></node2></node1>
Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.
I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.
I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?
EDIT Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.
This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
foo = """<node1>
<node2>
<node3>foo </node3>
</node2>
</node1>"""
bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)
Results in:
<node1><node2><node3>foo </node3></node2></node1>
Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:
parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)
I'd use XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="@*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
That should do the trick.
In python you could use lxml (direct link to sample on homepage) to transform it.
For some tests, use xsltproc
, sample:
xsltproc test.xsl test.xml
where test.xsl
is the file above and test.xml
your XML file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With