Crunching xml with python

Question

I need to remove white spaces between xml tags, e.g. if the original xml looks like:

<node1>
    <node2>
        <node3>foo</node3>
    </node2>
</node1>

I'd like the end-result to be crunched down to single line:

<node1><node2><node3>foo</node3></node2></node1>

Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.

I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.

I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?

EDIT Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.

Van Gale · Accepted Answer

This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

foo = """<node1>
    <node2>
        <node3>foo  </node3>
    </node2>
</node1>"""

bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)

Results in:

<node1><node2><node3>foo  </node3></node2></node1>

Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:

parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)

Johannes Weiss · Answer

I'd use XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:copy-of select="@*" />
            <xsl:apply-templates />
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

That should do the trick.

In python you could use lxml (direct link to sample on homepage) to transform it.

For some tests, use xsltproc, sample:

xsltproc test.xsl  test.xml

where test.xsl is the file above and test.xml your XML file.

Crunching xml with python

Tags:

python

xml

Sergey Golovchenko

2 Answers

Van Gale

Johannes Weiss

Recent Activity

Donate For Us

Crunching xml with python

Tags:

python

xml

Sergey Golovchenko

2 Answers

Van Gale

Johannes Weiss

Related questions

Recent Activity

Donate For Us