I have XML data that looks like:
<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>
I would like to be able to extract:
(3) is the most important requirement right now; etree provides (1) fine.
I cannot see any way to do (3) directly, but hoped that iterating through the elements in the document tree would return many small string that could be re-assembled, thus providing (2) and (3). However, requesting the .text of the root node only returns text between the root node and the first element, e.g. "The capital of ".
Doing (1) with SAX could involve implementing a lot that's already been written many times over, in e.g. minidom and etree. Using lxml isn't an option for the package that this code is to go into. Can anybody help?
iterparse()
function is available in xml.etree
:
import xml.etree.cElementTree as etree
for event, elem in etree.iterparse(file, events=('start', 'end')):
if event == 'start':
print(elem.tag) # use only tag name and attributes here
elif event == 'end':
# elem children elements, elem.text, elem.tail are available
if elem.text is not None and elem.tail is not None:
print(repr(elem.tail))
Another option is to override start()
, data()
, end()
methods of etree.TreeBuilder()
:
from xml.etree.ElementTree import XMLParser, TreeBuilder
class MyTreeBuilder(TreeBuilder):
def start(self, tag, attrs):
print("<%s>" % tag)
return TreeBuilder.start(self, tag, attrs)
def data(self, data):
print(repr(data))
TreeBuilder.data(self, data)
def end(self, tag):
return TreeBuilder.end(self, tag)
text = """<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>"""
# ElementTree.fromstring()
parser = XMLParser(target=MyTreeBuilder())
parser.feed(text)
root = parser.close() # return an ordinary Element
<xml>
'\nThe captial of '
<place>
'South Africa'
' is '
<place>
'Pretoria'
'.\n'
You need to look at the .tail
property as well as .text
: .text
gives you the text directly after a start tag, .tail
gives you the text directly after the end tag. This will provide you with your "many small strings".
Tip: you can use etree.iterwalk(elem)
(does the same thing as with etree.iterparse()
but over an existing tree instead) to iterate over the start and end tags. To the idea:
for event, elem in etree.iterwalk(xml_elem, events=('start', 'end')):
if event == 'start':
# it's a start tag
print 'starting element', elem.tag
print elem.text
elif event == 'end':
# it's an end tag
print 'ending element', elem.tag
if elem is not xml_elem:
# dont' want the text trailing xml_elem
print elem.tail
I guess you can complete the rest for yourself?
Warning: .text
and .tail
can be None
, so if you want to concatenate you will have to guard against that (use (elem.text or '')
for example)
If you are familiar with sax (or have existing sax code that does what you need), lxml lets you produce sax events from an element or tree:
lxml.sax.saxify(elem, handler)
Some other things to look for when extracting all the text from an element: the .itertext()
method, the xpath expression .//text()
(lxml lets you return "smart strings" from xpath expressions: they allow you to check which element they belong to etc...).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With