I am using XPath with Python lxml (Python 2). I run through two passes on the data, one to select the records of interest, and one to extract values from the data. Here is a sample of the type of code.
from lxml import etree
xml = """
<records>
<row id="1" height="160" weight="80" />
<row id="2" weight="70" />
<row id="3" height="140" />
</records>
"""
parsed = etree.fromstring(xml)
nodes = parsed.xpath('/records/row')
for node in nodes:
print node.xpath("@id|@height|@weight")
When I run this script the output is:
['1', '160', '80']
['2', '70']
['3', '140']
As you can see from the result, where an attribute is missing, the position of the other attributes changes, so I cannot tell in row 2 and 3 whether this is the height or the weight.
Is there a way to get the names of the attributes returned from etree/lxml? Ideally, I should be looking at a result in the format:
[('@id', '1'), ('@height', '160'), ('@weight', '80')]
I recognise that I can solve this specific case using elementtree and Python. However, I wish to resolve this using XPaths (and relatively simple XPaths), rather than process the data using python.
The xpath() method For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative): >>> f = StringIO('<foo><bar></bar></foo>') >>> tree = etree.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
You should try following:
for node in nodes:
print node.attrib
This will return dict of all attributes of node as {'id': '1', 'weight': '80', 'height': '160'}
If you want to get something like [('@id', '1'), ('@height', '160'), ('@weight', '80')]
:
list_of_attributes = []
for node in nodes:
attrs = []
for att in node.attrib:
attrs.append(("@" + att, node.attrib[att]))
list_of_attributes.append(attrs)
Output:
[[('@id', '1'), ('@height', '160'), ('@weight', '80')], [('@id', '2'), ('@weight', '70')], [('@id', '3'), ('@height', '140')]]
I was wrong in my assertion that I was not going to use Python. I found that the lxml/etree implementation is easily extended to that I can use the XPath DSL with modifications.
I registered the function "dictify". I changed the XPath expression to :
dictify('@id|@height|@weight|weight|height')
The new code is:
from lxml import etree
xml = """
<records>
<row id="1" height="160" weight="80" />
<row id="2" weight="70" ><height>150</height></row>
<row id="3" height="140" />
</records>
"""
def dictify(context, names):
node = context.context_node
rv = []
rv.append('__dictify_start_marker__')
names = names.split('|')
for n in names:
if n.startswith('@'):
val = node.attrib.get(n[1:])
if val != None:
rv.append(n)
rv.append(val)
else:
children = node.findall(n)
for child_node in children:
rv.append(n)
rv.append(child_node.text)
rv.append('__dictify_end_marker__')
return rv
etree_functions = etree.FunctionNamespace(None)
etree_functions['dictify'] = dictify
parsed = etree.fromstring(xml)
nodes = parsed.xpath('/records/row')
for node in nodes:
print node.xpath("dictify('@id|@height|@weight|weight|height')")
This produces the following output:
['__dictify_start_marker__', '@id', '1', '@height', '160', '@weight', '80', '__dictify_end_marker__']
['__dictify_start_marker__', '@id', '2', '@weight', '70', 'height', '150', '__dictify_end_marker__']
['__dictify_start_marker__', '@id', '3', '@height', '140', '__dictify_end_marker__']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With