Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python script to remove all comments from XML file

I am trying to build a python script that will take in an XML document and remove all of the comment blocks from it.

I tried something along the lines of:

tree = ElementTree()
tree.parse(file)
commentElements = tree.findall('//comment()')

for element in commentElements:
    element.parentNode.remove(element)

Doing this yields a weird error from python: "KeyError: '()'

I know there are ways to easily edit the file using other methods ( like sed ), but I have to do it in a python script.

like image 759
Jennifer Greentree Avatar asked May 03 '12 17:05

Jennifer Greentree


4 Answers

comment() is an XPath node test that is not supported by ElementTree.

You can use comment() with lxml. This library is quite similar to ElementTree and it has full support for XPath 1.0.

Here is how you can remove comments with lxml:

from lxml import etree

XML = """<root>
  <!-- COMMENT 1 -->
  <x>TEXT 1</x>
  <y>TEXT 2 <!-- COMMENT 2 --></y>
</root>"""

tree = etree.fromstring(XML)

comments = tree.xpath('//comment()')

for c in comments:
    p = c.getparent()
    p.remove(c)

print etree.tostring(tree)

Output:

<root>
  <x>TEXT 1</x>
  <y>TEXT 2 </y>
</root>
like image 109
mzjn Avatar answered Nov 08 '22 09:11

mzjn


Use strip_tags() from lxml.etree

from lxml import etree
XML = """<root>
  <!-- COMMENT 1 -->
  <x>TEXT 1</x>
  <y>TEXT 2 <!-- COMMENT 2 --></y>
  </root>"""

tree = etree.fromstring(XML)
print etree.tostring(tree)
etree.strip_tags(tree,etree.Comment)
print etree.tostring(tree)

Output:

<root>
<!-- COMMENT 1 -->
<x>TEXT 1</x>
<y>TEXT 2 <!-- COMMENT 2 --></y>
</root>
<root>

<x>TEXT 1</x>
<y>TEXT 2 </y>
</root>
like image 22
ctjctj2 Avatar answered Nov 08 '22 09:11

ctjctj2


The same as

https://stackoverflow.com/a/3317008/1458574

from lxml import etree
import sys

XML = open(sys.argv[1]).read()
parser =  etree.XMLParser(remove_comments=True)
tree= etree.fromstring(XML, parser = parser)
print etree.tostring(tree)
like image 27
user1458574 Avatar answered Nov 08 '22 09:11

user1458574


This is the solution I implemented using minidom:

 def removeCommentNodes(self):
        for tag in self.dom.getElementsByTagName("*"):
            for n in tag.childNodes:
                if n.nodeType is dom.Node.COMMENT_NODE:
                    n.parentNode.removeChild(n)

In practice I first retrieve all the tags in the xml, then for each tag I look for comment nodes and if found I remove them. (self.dom is a reference to the parsed xml)

like image 44
daveoncode Avatar answered Nov 08 '22 09:11

daveoncode