Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling nested elements with Python lxml

Tags:

python

xml

lxml

Given the simple XML data below:

<book>
   <title>My First Book</title>
   <abstract>
         <para>First paragraph of the abstract</para>
         <para>Second paragraph of the abstract</para>
    </abstract>
    <keywordSet>
         <keyword>First keyword</keyword>
         <keyword>Second keyword</keyword>
         <keyword>Third keyword</keyword>
    </keywordSet>
</book>

How can I traverse the tree, using lxml, and get all paragraphs in the "abstract" element, as well as all keywords in the "keywordSet" element?

The code snippet below returns only the first line of text in each element:

from lxml import objectify
root = objectify.fromstring(xml_string) # xml_string contains the XML data above
print root.title # returns the book title
for line in root.abstract:
    print line.para # returns only yhe first paragraph
for word in root.keywordSet:
    print word.keyword # returns only the first keyword in the set

I tried to follow this example, but the code above doesn't work as expected.

On a different tack, still better would be able to read the entire XML tree into a Python dictionary, with each element as the key and each text as the element item(s). I found out that something like this might be possible using lxml objectify, but I couldn't figure out how to achieve it.

One really big problem I have been finding when attempting to write XML parsing code in Python is that most of the "examples" provided are just too simple and entirely fictitious to be of much help -- or else they are just the opposite, using too complicated automatically-generated XML data!

Could anybody give me a hint?

Thanks in advance!

EDIT: After posting this question, I found a simple solution here.

So, my updated code becomes:

from lxml import objectify
    root = objectify.fromstring(xml_string) # xml_string contains the XML data above
    print root.title # returns the book title
    for para in root.abstract.iterchildren():
        print para # now returns the text of all paragraphs
    for keyword in root.keywordSet.iterchildren():
        print keyword # now returns all keywords in the set
like image 980
maurobio Avatar asked Oct 14 '14 20:10

maurobio


Video Answer


1 Answers

This is pretty simple using XPath:

from lxml import etree

tree = etree.parse('data.xml')

paragraphs = tree.xpath('/abstract/para/text()')
keywords = tree.xpath('/keywordSet/keyword/text()')

print paragraphs
print keywords

Output:

['First paragraph of the abstract', 'Second paragraph of the abstract']
['First keyword', 'Second keyword', 'Third keyword']

See the XPath Tutorial at W3Schools for details on the XPath syntax.

In particular, the elements used in the expressions above use

  • The / selector to select the root node / the immediate children.
  • The text() operator to select the text node (the "textual content") of the respective elements.

Here's how it could be done using the Objectify API:

from lxml import objectify

root = objectify.fromstring(xml_string)

paras = [p.text for p in root.abstract.para]
keywords = [k.text for k in root.keywordSet.keyword]

print paras
print keywords

It seems that root.abstract.para is actually shorthand for root.abstract.para[0]. So you need to explicitly use element.iterchildren() to access all child elements.

That's not true, we obviously both misunderstood the Objectify API: In order to iterate over the paras in abstract, you need to iterate over root.abstract.para, not root.abstract itself. It's weird, because you intuitively think about abstract as a collection or a container for its nodes, and that container would be represented by a Python iterable. But it's actually the .para selector that represents the sequence.

like image 199
Lukas Graf Avatar answered Oct 23 '22 01:10

Lukas Graf