It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input:
input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
... you could easily remove every <r>
element with:
from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)
However, how would you go about replacing each element with text, to get the output:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
It seems to me that because the ElementTree API deals with text via the .text
and .tail
attributes of each element rather than nodes in the tree, this means you have to deal with a lot of different cases depending on whether the element has sibling elements or not, whether the existing element had a .tail
attribute, and so on. Have I missed some easy way of doing this?
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.
Using strip_elements
has the disadvantage that you cannot make it keep some of the <r>
elements while replacing others. It also requires the existence of an ElementTree
instance (which may be not the case). And last, you cannot use it to replace XML comments or processing instructions.
The following should do your job:
for r in f.xpath('//r'):
text = 'DELETED' + r.tail
parent = r.getparent()
if parent is not None:
previous = r.getprevious()
if previous is not None:
previous.tail = (previous.tail or '') + text
else:
parent.text = (parent.text or '') + text
parent.remove(r)
I think that unutbu's XSLT solution is probably the correct way to achieve your goal.
However, here's a somewhat hacky way to achieve it, by modifying the tails of <r/>
tags and then using etree.strip_elements
.
from lxml import etree
data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'
etree.strip_elements(f,'r',with_tail=False)
print etree.tostring(f,pretty_print=True)
Gives you:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With