Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve attribute names and values with Python / lxml and XPath

Tags:

python

xpath

lxml

I am using XPath with Python lxml (Python 2). I run through two passes on the data, one to select the records of interest, and one to extract values from the data. Here is a sample of the type of code.

from lxml import etree

xml = """
  <records>
    <row id="1" height="160" weight="80" />
    <row id="2" weight="70" />
    <row id="3" height="140" />
  </records>
"""

parsed = etree.fromstring(xml)
nodes = parsed.xpath('/records/row')
for node in nodes:
    print node.xpath("@id|@height|@weight")

When I run this script the output is:

['1', '160', '80']
['2', '70']
['3', '140']

As you can see from the result, where an attribute is missing, the position of the other attributes changes, so I cannot tell in row 2 and 3 whether this is the height or the weight.

Is there a way to get the names of the attributes returned from etree/lxml? Ideally, I should be looking at a result in the format:

[('@id', '1'), ('@height', '160'), ('@weight', '80')]

I recognise that I can solve this specific case using elementtree and Python. However, I wish to resolve this using XPaths (and relatively simple XPaths), rather than process the data using python.

like image 778
Kevin Gill Avatar asked Feb 23 '17 10:02

Kevin Gill


People also ask

What is XPath in lxml?

The xpath() method For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative): >>> f = StringIO('<foo><bar></bar></foo>') >>> tree = etree.

Is XML and lxml are same?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.


2 Answers

You should try following:

for node in nodes:
    print node.attrib

This will return dict of all attributes of node as {'id': '1', 'weight': '80', 'height': '160'}

If you want to get something like [('@id', '1'), ('@height', '160'), ('@weight', '80')]:

list_of_attributes = []
for node in nodes:
    attrs = []
    for att in node.attrib:
        attrs.append(("@" + att, node.attrib[att]))
    list_of_attributes.append(attrs)

Output:

[[('@id', '1'), ('@height', '160'), ('@weight', '80')], [('@id', '2'), ('@weight', '70')], [('@id', '3'), ('@height', '140')]]
like image 158
Andersson Avatar answered Sep 27 '22 23:09

Andersson


I was wrong in my assertion that I was not going to use Python. I found that the lxml/etree implementation is easily extended to that I can use the XPath DSL with modifications.

I registered the function "dictify". I changed the XPath expression to :

dictify('@id|@height|@weight|weight|height')

The new code is:

from lxml import etree

xml = """
<records>
    <row id="1" height="160" weight="80" />
    <row id="2" weight="70" ><height>150</height></row>
    <row id="3" height="140" />
</records>
"""

def dictify(context, names):
    node = context.context_node
    rv = []
    rv.append('__dictify_start_marker__')
    names = names.split('|')
    for n in names:
        if n.startswith('@'):
            val =  node.attrib.get(n[1:])
            if val != None:
                rv.append(n)
                rv.append(val)
        else:
            children = node.findall(n)
            for child_node in children:
                rv.append(n)
                rv.append(child_node.text)
    rv.append('__dictify_end_marker__')
    return rv

etree_functions = etree.FunctionNamespace(None)
etree_functions['dictify'] = dictify


parsed = etree.fromstring(xml)
nodes = parsed.xpath('/records/row')
for node in nodes:
    print node.xpath("dictify('@id|@height|@weight|weight|height')")

This produces the following output:

['__dictify_start_marker__', '@id', '1', '@height', '160', '@weight', '80', '__dictify_end_marker__']
['__dictify_start_marker__', '@id', '2', '@weight', '70', 'height', '150', '__dictify_end_marker__']
['__dictify_start_marker__', '@id', '3', '@height', '140', '__dictify_end_marker__']
like image 25
Kevin Gill Avatar answered Sep 28 '22 00:09

Kevin Gill