Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python lxml - How to remove empty repeated tags

Tags:

python

xml

lxml

I have some XML that is generated by a script that may or may not have empty elements. I was told that now we cannot have empty elements in the XML. Here is an example:

<customer>  
    <govId>
       <id>@</id>
       <idType>SSN</idType>
           <issueDate/>
           <expireDate/>
           <dob/>
           <state/>
           <county/>
           <country/>
    </govId>
    <govId>
        <id/>
        <idType/>
        <issueDate/>
        <expireDate/>
        <dob/>
        <state/>
        <county/>
        <country/>
    </govId>
</customer>

The output should look like this:

<customer>  
    <govId>
       <id>@</id>
       <idType>SSN</idType>        
    </govId>        
</customer>

I need to remove all the empty elements. You'll note that my code took out the empty stuff in the "govId" sub-element, but didn't take out anything in the second. I am using lxml.objectify at the moment.

Here is basically what I am doing:

root = objectify.fromstring(xml)
for customer in root.customers.iterchildren():
    for e in customer.govId.iterchildren():
        if not e.text:
            customer.govId.remove(e)

Does anyone know of a way to do this with lxml objectify or is there an easier way period? I would also like to remove the second "govId" element in its entirety if all its elements are empty.

like image 912
Mike Driscoll Avatar asked Oct 02 '12 16:10

Mike Driscoll


1 Answers

First of all, the problem with your code is that you are iterating over customers, but not over govIds. On the third line you take the first govId for every customer, and iterate over its children. So, you'd need a another for loop for the code to work like you intended it to.

This small sentence at the end of your question then makes the problem quite a bit more complex: I would also like to remove the second "govId" element in its entirety if all its elements are empty.

This means, unless you want to hard code just checking one level of nesting, you need to recursively check if an element and it's children are empty. Like this for example:

def recursively_empty(e):
   if e.text:
       return False
   return all((recursively_empty(c) for c in e.iterchildren()))

Note: Python 2.5+ because of the use of the all() builtin.

You then can change your code to something like this to remove all the elements in the document that are empty all the way down.

# Walk over all elements in the tree and remove all
# nodes that are recursively empty
context = etree.iterwalk(root)
for action, elem in context:
    parent = elem.getparent()
    if recursively_empty(elem):
        parent.remove(elem)

Sample output:

<customer>
  <govId>
    <id>@</id>
    <idType>SSN</idType>
  </govId>
</customer>

One thing you might want to do is refine the condition if e.text: in the recursive function. Currently this will consider None and the empty string as empty, but not whitespace like spaces and newlines. Use str.strip() if that's part of your definition of "empty".


Edit: As pointed out by @Dave, the recursive function could be improved by using a generator expression:

return all((recursively_empty(c) for c in e.getchildren()))

This will not evaluate recursively_empty(c) for all the children at once, but evaluate it for each one lazily. Since all() will stop iteration upon the first False element, this could mean a significant performance improvement.

Edit 2: The expression can be further optimized by using e.iterchildren() instead of e.getchildren(). This works with the lxml etree API and the objectify API.

like image 84
Lukas Graf Avatar answered Sep 30 '22 16:09

Lukas Graf