The easiest solution is probably using lxml, where you can set a parser option to ignore white space between elements:
>>> from lxml import etree
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> xml_str = '''<root>
>>> <head></head>
>>> <content></content>
>>> </root>'''
>>> elem = etree.XML(xml_str, parser=parser)
>>> print etree.tostring(elem)
<root><head/><content/></root>
This will probably be enough for your needs, but some warnings to be on the safe side:
This will just remove whitespace nodes between elements, and try not to remove whitespace nodes inside elements with mixed content:
>>> elem = etree.XML('<p> spam <a>ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p> spam <a>ham</a> <a>eggs</a></p>
Leading or trailing whitespace from textnodes will not be removed. It will however still in some circumstances remove whitespace nodes from mixed content: if the parser has not encountered non-whitespace nodes at that level yet.
>>> elem = etree.XML('<p><a> ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p><a> ham</a><a>eggs</a></p>
If you don't want that, you can use xml:space="preserve"
, which will be respected. Another option would be using a dtd and use etree.XMLParser(load_dtd=True)
, where the parser will use the dtd to determine which whitespace nodes are significant or not.
Other than that, you will have to write your own code to remove the whitespace you don't want (iterating descendants, and where appropriate, set .text
and .tail
properties that contain only whitespace to None
or empty string)
Here's something quick I came up with because I didn't want to use lxml:
from xml.dom import minidom
from xml.dom.minidom import Node
def remove_blanks(node):
for x in node.childNodes:
if x.nodeType == Node.TEXT_NODE:
if x.nodeValue:
x.nodeValue = x.nodeValue.strip()
elif x.nodeType == Node.ELEMENT_NODE:
remove_blanks(x)
xml = minidom.parse('file.xml')
remove_blanks(xml)
xml.normalize()
with file('file.xml', 'w') as result:
result.write(xml.toprettyxml(indent = ' '))
Which I really only needed to re-indent the XML file with otherwise broken indentation. It doesn't respect the preserve
directive, but, honestly, so do so many other software dealing with XMLs, that it's rather a funny requirement :) Also, you'd be able to easily add that sort of functionality to the code above (just check for space
attribute, and don't recure if its value is 'preserve'.)
Whitespace is significant within an XML document. Using whitespace for indentation is a poor use of XML, as it introduces significant data where there really is none -- and sadly, this is the norm. Any programmatic approach you take to stripping out whitespace will be, at best, a guess - you need better knowledge of what the XML is conveying to properly remove whitespace, without stepping on some piece of data's toes.
The only thing that bothers me about xml.dom.minidom's toprettyxml() is that it adds blank lines. I don't seem to get the split components, so I just wrote a simple function to remove the blank lines:
#!/usr/bin/env python
import xml.dom.minidom
# toprettyxml() without the blank lines
def prettyPrint(x):
for line in x.toprettyxml().split('\n'):
if not line.strip() == '':
print line
xml_string = "<monty>\n<example>something</example>\n<python>parrot</python>\n</monty>"
# parse XML
x = xml.dom.minidom.parseString(xml_string)
# clean
prettyPrint(x)
And this is what the code outputs:
<?xml version="1.0" ?>
<monty>
<example>something</example>
<python>parrot</python>
</monty>
If I use toprettyxml() by itself, i.e. print(toprettyxml(x)), it adds unnecessary blank lines:
<?xml version="1.0" ?>
<monty>
<example>something</example>
<python>parrot</python>
</monty>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With