Given the simple XML data below:
<book>
<title>My First Book</title>
<abstract>
<para>First paragraph of the abstract</para>
<para>Second paragraph of the abstract</para>
</abstract>
<keywordSet>
<keyword>First keyword</keyword>
<keyword>Second keyword</keyword>
<keyword>Third keyword</keyword>
</keywordSet>
</book>
How can I traverse the tree, using lxml, and get all paragraphs in the "abstract" element, as well as all keywords in the "keywordSet" element?
The code snippet below returns only the first line of text in each element:
from lxml import objectify
root = objectify.fromstring(xml_string) # xml_string contains the XML data above
print root.title # returns the book title
for line in root.abstract:
print line.para # returns only yhe first paragraph
for word in root.keywordSet:
print word.keyword # returns only the first keyword in the set
I tried to follow this example, but the code above doesn't work as expected.
On a different tack, still better would be able to read the entire XML tree into a Python dictionary, with each element as the key and each text as the element item(s). I found out that something like this might be possible using lxml objectify, but I couldn't figure out how to achieve it.
One really big problem I have been finding when attempting to write XML parsing code in Python is that most of the "examples" provided are just too simple and entirely fictitious to be of much help -- or else they are just the opposite, using too complicated automatically-generated XML data!
Could anybody give me a hint?
Thanks in advance!
EDIT: After posting this question, I found a simple solution here.
So, my updated code becomes:
from lxml import objectify
root = objectify.fromstring(xml_string) # xml_string contains the XML data above
print root.title # returns the book title
for para in root.abstract.iterchildren():
print para # now returns the text of all paragraphs
for keyword in root.keywordSet.iterchildren():
print keyword # now returns all keywords in the set
This is pretty simple using XPath:
from lxml import etree
tree = etree.parse('data.xml')
paragraphs = tree.xpath('/abstract/para/text()')
keywords = tree.xpath('/keywordSet/keyword/text()')
print paragraphs
print keywords
Output:
['First paragraph of the abstract', 'Second paragraph of the abstract']
['First keyword', 'Second keyword', 'Third keyword']
See the XPath Tutorial at W3Schools for details on the XPath syntax.
In particular, the elements used in the expressions above use
/
selector to select the root node / the immediate children.text()
operator to select the text node (the "textual content") of the respective elements.Here's how it could be done using the Objectify API:
from lxml import objectify
root = objectify.fromstring(xml_string)
paras = [p.text for p in root.abstract.para]
keywords = [k.text for k in root.keywordSet.keyword]
print paras
print keywords
It seems that root.abstract.para
is actually shorthand for root.abstract.para[0]
. So you need to explicitly use element.iterchildren()
to access all child elements.
That's not true, we obviously both misunderstood the Objectify API:
In order to iterate over the para
s in abstract
, you need to iterate over root.abstract.para
, not root.abstract
itself. It's weird, because you intuitively think about abstract
as a collection or a container for its nodes, and that container would be represented by a Python iterable. But it's actually the .para
selector that represents the sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With