Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing XML in Python using ElementTree example

I'm having a hard time finding a good, basic example of how to parse XML in python using Element Tree. From what I can find, this appears to be the easiest library to use for parsing XML. Here is a sample of the XML I'm working with:

<timeSeriesResponse>
    <queryInfo>
        <locationParam>01474500</locationParam>
        <variableParam>99988</variableParam>
        <timeParam>
            <beginDateTime>2009-09-24T15:15:55.271</beginDateTime>
            <endDateTime>2009-11-23T15:15:55.271</endDateTime>
        </timeParam>
     </queryInfo>
     <timeSeries name="NWIS Time Series Instantaneous Values">
         <values count="2876">
            <value dateTime="2009-09-24T15:30:00.000-04:00" qualifiers="P">550</value>
            <value dateTime="2009-09-24T16:00:00.000-04:00" qualifiers="P">419</value>
            <value dateTime="2009-09-24T16:30:00.000-04:00" qualifiers="P">370</value>
            .....
         </values>
     </timeSeries>
</timeSeriesResponse>

I am able to do what I need, using a hard-coded method. But I need my code to be a bit more dynamic. Here is what worked:

tree = ET.parse(sample.xml)
doc = tree.getroot()

timeseries =  doc[1]
values = timeseries[2]

print child.attrib['dateTime'], child.text
#prints 2009-09-24T15:30:00.000-04:00, 550

Here are a couple of things I've tried, none of them worked, reporting that they couldn't find timeSeries (or anything else I tried):

tree = ET.parse(sample.xml)
tree.find('timeSeries')

tree = ET.parse(sample.xml)
doc = tree.getroot()
doc.find('timeSeries')

Basically, I want to load the xml file, search for the timeSeries tag, and iterate through the value tags, returning the dateTime and the value of the tag itself; everything I'm doing in the above example, but not hard coding the sections of xml I'm interested in. Can anyone point me to some examples, or give me some suggestions on how to work through this?


Thanks for all the help. Using both of the below suggestions worked on the sample file I provided, however, they didn't work on the full file. Here is the error I get from the real file when I use Ed Carrel's method:

 (<type 'exceptions.AttributeError'>, AttributeError("'NoneType' object has no attribute 'attrib'",), <traceback object at 0x011EFB70>)

I figured there was something in the real file it didn't like, so I incremently removed things until it worked. Here are the lines that I changed:

originally: <timeSeriesResponse xsi:schemaLocation="a URL I removed" xmlns="a URL I removed" xmlns:xsi="a URL I removed">
 changed to: <timeSeriesResponse>

 originally:  <sourceInfo xsi:type="SiteInfoType">
 changed to: <sourceInfo>

 originally: <geogLocation xsi:type="LatLonPointType" srs="EPSG:4326">
 changed to: <geogLocation>

Removing the attributes that have 'xsi:...' fixed the problem. Is the 'xsi:...' not valid XML? It will be hard for me to remove these programmatically. Any suggested work arounds?

Here is the full XML file: http://www.sendspace.com/file/lofcpt


When I originally asked this question, I was unaware of namespaces in XML. Now that I know what's going on, I don't have to remove the "xsi" attributes, which are the namespace declarations. I just include them in my xpath searches. See this page for more info on namespaces in lxml.

like image 524
Casey Avatar asked Nov 23 '09 22:11

Casey


People also ask

How do I parse XML in Python?

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.

What is XML Etree ElementTree in Python?

The xml. etree. ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.

How do I iterate over an XML tag in Python?

To iterate over all nodes, use the iter method on the ElementTree , not the root Element. The root is an Element, just like the other elements in the tree and only really has context of its own attributes and children. The ElementTree has the context for all Elements.


2 Answers

So I have ElementTree 1.2.6 on my box now, and ran the following code against the XML chunk you posted:

import elementtree.ElementTree as ET

tree = ET.parse("test.xml")
doc = tree.getroot()
thingy = doc.find('timeSeries')

print thingy.attrib

and got the following back:

{'name': 'NWIS Time Series Instantaneous Values'}

It appears to have found the timeSeries element without needing to use numerical indices.

What would be useful now is knowing what you mean when you say "it doesn't work." Since it works for me given the same input, it is unlikely that ElementTree is broken in some obvious way. Update your question with any error messages, backtraces, or anything you can provide to help us help you.

like image 94
Ed Carrel Avatar answered Oct 10 '22 06:10

Ed Carrel


If I understand your question correctly:

for elem in doc.findall('timeSeries/values/value'):
    print elem.get('dateTime'), elem.text

or if you prefer (and if there is only one occurrence of timeSeries/values:

values = doc.find('timeSeries/values')
for value in values:
    print value.get('dateTime'), elem.text

The findall() method returns a list of all matching elements, whereas find() returns only the first matching element. The first example loops over all the found elements, the second loops over the child elements of the values element, in this case leading to the same result.

I don't see where the problem with not finding timeSeries comes from however. Maybe you just forgot the getroot() call? (note that you don't really need it because you can work from the elementtree itself too, if you change the path expression to for example /timeSeriesResponse/timeSeries/values or //timeSeries/values)

like image 22
Steven Avatar answered Oct 10 '22 07:10

Steven