Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove elements from XML using Python

Tags:

python

xml

I got stuck with XML and Python. The task is simple but I couldn't resolve it so far and spent on that long time. I came here for an advice how to solve it with couple of lines.

Thanks for any help with traversing the tree. I always ended up with too many or too few elements. Elements can be nested without limit. Given example is just an example. I will accept any solution, not picky about dom, minidom, sax, whatever..

I have an XML file similar to this one:

<root>
    <elm>
        <elm>Common content</elm>

        <elm xmlns="http://example.org/ns">
            <elm lang="en">Content EN</elm>
            <elm lang="cs">žluťoučký koníček</elm>
        </elm>

        <elm xml:id="abc123">Common content</elm>

        <elm lang="en">Content EN</elm>
        <elm lang="cs">Content CS</elm>

        <elm lang="en">
            <elm>Content EN</elm>
            <elm>Content EN</elm>
        </elm>

        <elm lang="cs">
            <elm>Content CS</elm>
            <elm>Content CS</elm>
        </elm>
    </elm>
</root>

What I need - parse the XML and write a new file. The new file should contain all the elements for given language and elements without lang attribute.

For "cs" language the output file should containt this:

<root>
    <elm>
        <elm>Common content</elm>

        <elm xmlns="http://example.org/ns">
            <elm lang="cs">žluťoučký koníček</elm>
        </elm>

        <elm xml:id="abc123">Common content</elm>

        <elm lang="cs">Content CS</elm>

        <elm lang="cs">
            <elm>Content CS</elm>
            <elm>Content CS</elm>
        </elm>
    </elm>
</root>

If you can make it to omit the lang attribute in the new file, even better. But it's not that important.

UPDATE1: Added unicode characters and namespace attribute.

UPDATE2: Using Python 2.5, standard libraries preferred.

like image 880
dwich Avatar asked Aug 29 '10 01:08

dwich


People also ask

What is XML Etree ElementTree in Python?

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.


2 Answers

updating @Alex Martelli's code to remove a bug where the element list is updated in place. Above solution will give wrong answer if the input is little more complex.

import sys
from xml.etree import cElementTree as et

def picklang(path, lang='en'):
    tr = et.parse(path)
    for element in tr.iter():
        for subelement in element[:]:
            la = subelement.get('lang')

            if la is not None and la != lang:
                element.remove(subelement)
    return tr

if __name__ == '__main__':
    tr = picklang('la.xml')
    tr.write(sys.stdout)
    print

Code in line 7 for subelement in element: is changed to for subelement in element[:]: as it is incorrect to update list in place while iterating over it.

This code iterates over a copy of element list and removes elements when lang != "en" in the original element list.

like image 164
bhuvi Avatar answered Oct 19 '22 19:10

bhuvi


Using lxml:

import lxml.etree as le

with open('doc.xml','r') as f:
    doc=le.parse(f)
    for elem in doc.xpath('//*[attribute::lang]'):
        if elem.attrib['lang']=='en':
            elem.attrib.pop('lang')
        else:
            parent=elem.getparent()
            parent.remove(elem)
    print(le.tostring(doc))

yields

<root>
    <elm>Common content</elm>

    <elm>
        <elm>Content EN</elm>
        </elm>

    <elm>Common content</elm>

    <elm>Content EN</elm>
    <elm>
        <elm>Content EN</elm>
        <elm>Content EN</elm>
    </elm>

    </root>
like image 34
unutbu Avatar answered Oct 19 '22 17:10

unutbu