Efficient way of XML parsing in ElementTree(1.3.0) Python

Tags:

I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django).

Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text

<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">

    <instrumentConfiguration id="QTOF">
                    <cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
                    <componentList count="4">
                            <source order="1">
                                    <cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
                            </source>
                            <analyzer order="2">
                                    <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
                            </analyzer>
                            <analyzer order="3">
                                    <cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
                            </analyzer>
                            <detector order="4">
                                    <cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
                            </detector>
                    </componentList>
     </instrumentConfiguration>

Small but complete file is here

So what I have done till now is using findall for every element of interest.

import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
    insattrib=s[ins].attrib
    # It will print out all the id attribute of instrument
    print insattrib["id"]

How can I access all children/grandchildren of instrumentConfiguration (s) element?

s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')

Example of what I want

InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector

Is there efficient way of parsing element/subelement/subelement when namespace exist? Or do I have to use find/findall every time to access particular element in the tree with namespace? This is just a small example I have to parse more complex element hierarchy.

Any suggestions!

Edit

Didn't got the correct answer so have to edit once more!

447

asked Sep 25 '11 10:09

thchand

2 Answers

Here's a script that parses one million <instrumentConfiguration/> elements (967MB file) in 40 seconds (on my machine) without consuming large amount of memory.

The throughput is 24MB/s. The cElementTree page (2005) reports 47MB/s.

#!/usr/bin/env python
from itertools import imap, islice, izip
from operator  import itemgetter
from xml.etree import cElementTree as etree

def parsexml(filename):
    it = imap(itemgetter(1),
              iter(etree.iterparse(filename, events=('start',))))
    root = next(it) # get root element
    for elem in it:
        if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
            values = [('Id', elem.get('id')),
                      ('Parameter1', next(it).get('name'))] # cvParam
            componentList_count = int(next(it).get('count'))
            for parent, child in islice(izip(it, it), componentList_count):
                key = parent.tag.partition('}')[2]
                value = child.get('name')
                assert child.tag.endswith('cvParam')
                values.append((key, value))
            yield values
            root.clear() # preserve memory

def print_values(it):
    for line in (': '.join(val) for conf in it for val in conf):
        print(line)

print_values(parsexml(filename))

Output

$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps

Note: The code is fragile it assumes that the first two children of <instrumentConfiguration/> are <cvParam/> and <componentList/> and all values are available as tag names or attributes.

On performance

ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.

If you replace root.clear() by elem.clear() then the code is ~10% faster but ~10 times more memory. lxml.etree works with elem.clear() variant, the performance is the same as for cElementTree but it consumes 20 (root.clear()) / 2 (elem.clear()) times as much memory (500MB).

answered Sep 28 '22 14:09

jfs

If this is still a current issue, you might try pymzML, a python Interface to mzML Files. Website: http://pymzml.github.com/

answered Sep 28 '22 15:09

JBa

Related questions
                            
                                Python (yield): all paths from leaves to root in a tree
                            
                                Python and tfidf algorithm, make it faster?
                            
                                Python datetime to microtime
                            
                                How is http://shell.appspot.com/ executing code online?
                            
                                Getting the dtype of a result array in numpy
                            
                                Python Emailing - Use of colon causes no output
                            
                                Using GET and POST with Authorization HTTP header in Python
                            
                                Python __future__ outside of a specific module
                            
                                Django - Passing parameters to inline formset
                            
                                Getting a JSON request in a view (using Django)
                            
                                Does Scikit-learn release the python GIL?
                            
                                Python & GTK3: How to create a Liststore
                            
                                How to use split with utf8 coding?
                            
                                Can someone please recommend me a good PyQt/PySide tutorial/book/video series? [closed]
                            
                                Spawning a separate thread of execution (i.e. sending log email to dev) in Flask Python?
                            
                                python subprocess with gzip
                            
                                Submodule importing primary module
                            
                                How do I make a query where it filters everything that starts with a number in Django?
                            
                                Remove contents of <style>...</style> tags using html5lib or bleach
                            
                                Divide set into subsets with equal number of elements

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way of XML parsing in ElementTree(1.3.0) Python

Tags:

performance

python

parsing

xml

lxml

thchand

People also ask

2 Answers

Output

On performance

jfs

JBa

Recent Activity

Donate For Us