Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crunching xml with python

Tags:

python

xml

I need to remove white spaces between xml tags, e.g. if the original xml looks like:

<node1>
    <node2>
        <node3>foo</node3>
    </node2>
</node1>

I'd like the end-result to be crunched down to single line:

<node1><node2><node3>foo</node3></node2></node1>

Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.

I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.

I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?

EDIT Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.

like image 522
Sergey Golovchenko Avatar asked Mar 20 '09 18:03

Sergey Golovchenko


2 Answers

This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

foo = """<node1>
    <node2>
        <node3>foo  </node3>
    </node2>
</node1>"""

bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)

Results in:

<node1><node2><node3>foo  </node3></node2></node1>

Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:

parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)
like image 168
Van Gale Avatar answered Sep 28 '22 07:09

Van Gale


I'd use XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:copy-of select="@*" />
            <xsl:apply-templates />
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

That should do the trick.

In python you could use lxml (direct link to sample on homepage) to transform it.

For some tests, use xsltproc, sample:

xsltproc test.xsl  test.xml

where test.xsl is the file above and test.xml your XML file.

like image 28
Johannes Weiss Avatar answered Sep 28 '22 07:09

Johannes Weiss