I'm parsing an XML document that I receive from a vendor everyday and it uses namespaces heavily. I've minimized the problem to a minimal subset here:
There are some elements I need to parse, all of which are children of an element with a specific attribute in it.
I am able to use lxml.etree.Element.findall(TAG, root.nsmap)
to find the candidate nodes whose attribute I need to check.
I'm then trying to check the attribute of each of these Elements
via the name I know it uses : which concretely here is ss:Name
. If the value of that attribute is the desired value I'm going to dive deeper into said Element
(to continue doing other things).
How can I do this?
The XML I'm parsing is roughly
<FOO xmlns="SOME_REALLY_LONG_STRING"
some gorp declaring a bunch of namespaces one of which is
xmlns:ss="THE_VERY_SAME_REALLY_LONG_STRING_AS_ROOT"
>
<child_of_foo>
....
</child_of_foo>
...
<SomethingIWant ss:Name="bar" OTHER_ATTRIBS_I_DONT_CARE_ABOUT>
....
<MoreThingsToLookAtLater>
....
</MoreThingsToLookAtLater>
....
</SomethingIWant>
...
</FOO>
I found the first Element I wanted SomethingIWant
like so (ultimately I want them all so I did find all)
import lxml
from lxml import etree
tree = etree.parse(myfilename)
root = tree.getroot()
# i want just the first one for now
my_sheet = root.findall('ss:RecordSet', root.nsmap)[0]
Now I want to get the ss:Name
attribute from this element, to check it, but I'm not sure how?
I know that my_sheet.attrib
will display me the raw URI followed by the attribute name, but I don't want that. I need to check if it has a specific value for a specific namespaced attribute. (Because if it's wrong I can skip this element from further processing entirely).
I tried using lxml.etree.ElementTree.attrib.get()
but I don't seem to obtain anything useful.
Any ideas?
One of advantages of lxml
over standard python XML parser is lxml
's full-support of XPath 1.0 specfication via xpath()
method. So I would go with xpath()
method most of the time. Working example for your current case :
from lxml import etree
xml = """<FOO xmlns="SOME_REALLY_LONG_STRING"
xmlns:ss="THE_VERY_SAME_REALLY_LONG_STRING_AS_ROOT"
>
<child_of_foo>
....
</child_of_foo>
...
<SomethingIWant ss:Name="bar">
....
</SomethingIWant>
...
</FOO>"""
root = etree.fromstring(xml)
ns = {'ss': 'THE_VERY_SAME_REALLY_LONG_STRING_AS_ROOT'}
# i want just the first one for now
result = root.xpath('//@ss:Name', namespaces=ns)[0]
print(result)
output :
bar
UPDATE :
Modified example demonstrating how to get attribute in namespace from current element
:
ns = {'ss': 'THE_VERY_SAME_REALLY_LONG_STRING_AS_ROOT', 'd': 'SOME_REALLY_LONG_STRING'}
element = root.xpath('//d:SomethingIWant', namespaces=ns)[0]
print(etree.tostring(element))
attribute = element.xpath('@ss:Name', namespaces=ns)[0]
print(attribute)
output :
<SomethingIWant xmlns="SOME_REALLY_LONG_STRING" xmlns:ss="THE_VERY_SAME_REALLY_LONG_STRING_AS_ROOT" ss:Name="bar">
....
</SomethingIWant>
...
bar
I'm pretty sure this is a horribly NON-PYTHONIC non ideal way to do it; and it seems like there must be a better way... but I discovered I could do this:
SS_REAL = "{%s}" % root.nsmap.get('ss')
and then I could do:
my_sheet.get( SS_REAL + "NAME" )
It gets me what I want.. but this can't possibly be the right way to do this..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With