I am new to xml parsing. This xml file has the following tree:
FHRSEstablishment
|--> Header
| |--> ...
|--> EstablishmentCollection
| |--> EstablishmentDetail
| | |-->...
| |--> Scores
| | |-->...
|--> EstablishmentCollection
| |--> EstablishmentDetail
| | |-->...
| |--> Scores
| | |-->...
but when I access it with ElementTree and look for the child
tags and attributes,
import xml.etree.ElementTree as ET
import urllib2
tree = ET.parse(
file=urllib2.urlopen('http://ratings.food.gov.uk/OpenDataFiles/FHRS408en-GB.xml' % i))
root = tree.getroot()
for child in root:
print child.tag, child.attrib
I only get:
Header {}
EstablishmentCollection {}
which I assume means that their attributes are empty. Why is it so, and how can I access the children nested inside EstablishmentDetail
and Scores
?
EDIT
Thanks to the answers below I can get inside the tree, but if I want to retrieve values such as those in Scores
, this fails:
for node in root.find('.//EstablishmentDetail/Scores'):
rating = node.attrib.get('Hygiene')
print rating
and produces
None
None
None
Why is that?
The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.
To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().
Parsing from strings and files. lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.
Yo have to iter() over your root.
that is root.iter()
would do the trick!
import xml.etree.ElementTree as ET
import urllib2
tree =ET.parse(urllib2.urlopen('http://ratings.food.gov.uk/OpenDataFiles/FHRS408en-GB.xml'))
root = tree.getroot()
for child in root.iter():
print child.tag, child.attrib
Output:
FHRSEstablishment {}
Header {}
ExtractDate {}
ItemCount {}
ReturnCode {}
EstablishmentCollection {}
EstablishmentDetail {}
FHRSID {}
LocalAuthorityBusinessID {}
...
EstablishmentDetail
you need to find that tag and then loop through its children!That is, for example.
for child in root.find('.//EstablishmentDetail'):
print child.tag, child.attrib
Output:
FHRSID {}
LocalAuthorityBusinessID {}
BusinessName {}
BusinessType {}
BusinessTypeID {}
RatingValue {}
RatingKey {}
RatingDate {}
LocalAuthorityCode {}
LocalAuthorityName {}
LocalAuthorityWebSite {}
LocalAuthorityEmailAddress {}
Scores {}
SchemeType {}
NewRatingPending {}
Geocode {}
Hygiene
as you've mentioned in comment,What you have done is, it will get the first Scores
tag and that will have Hygiene, ConfidenceInManagement, Structural tags as child when you call for each in root.find('.//Scores'):rating=child.get('Hygiene')
. That is, obviously all three child will not have the element!
You need to first
- find all Scores
tag.
- find Hygiene
in every tags found!
for each in root.findall('.//Scores'):
rating = each.find('.//Hygiene')
print '' if rating is None else rating.text
Output:
5
5
5
0
5
Hope it could be useful:
import xml.etree.ElementTree as etree
with open('filename.xml') as tmpfile:
doc = etree.iterparse(tmpfile, events=("start", "end"))
doc = iter(doc)
event, root = doc.next()
num = 0
for event, elem in doc:
print event, elem
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With