The xml. etree. ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.
This type of tree structure is applicable to XML files as well. Therefore, the BeautifulSoup class can also be used to parse XML files directly. The installation of BeautifulSoup has already been discussed at the end of the lesson on Setting up for Python programming.
You need to give the .find()
, findall()
and iterfind()
methods an explicit namespace dictionary:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
Prefixes are only looked up in the namespaces
parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl:
part, looks up the corresponding namespace URL in the namespaces
dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class
instead. You can use the same syntax yourself too of course:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
Also see the Parsing XML with Namespaces section of the ElementTree documentation.
If you can switch to the lxml
library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap
attribute on elements and generally has superior namespaces support.
Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
UPDATE:
5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
Here's another case and how I handled it:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.
To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse
function, parsing only namespace start events (start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
Then the dictionary can be passed as argument to the search functions:
root.findall('owl:Class', my_namespaces)
I've been using similar code to this and have found it's always worth reading the documentation... as usual!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.
root.iter()
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements "Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"
To get the namespace in its namespace format, e.g. {myNameSpace}
, you can do the following:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
link = root.find(f"{ns}link")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With