It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input: <pre class="prettyprint"><code>input = '''<everything> <m>Some text before <r/></m> <m><r/> and some text after.</m> <m><r/></m> <m>Text before <r/> and after</m> <m> Text after a sibling <r/> Text before a sibling</m> </everything> ''' </code></pre> ... you could easily remove every <code><r></code> element with: <pre class="prettyprint"><code>from lxml import etree f = etree.fromstring(data) for r in f.xpath('//r'): r.getparent().remove(r) print etree.tostring(f, pretty_print=True) </code></pre> However, how would you go about replacing each element with text, to get the output: <pre class="prettyprint"><code><everything> <m>Some text before DELETED</m> <m>DELETED and some text after.</m> <m>DELETED</m> <m>Text before DELETED and after</m> <m>Text after a sibling DELETED Text before a sibling</m> </everything> </code></pre> It seems to me that because the ElementTree API deals with text via the <code>.text</code> and <code>.tail</code> attributes of each element rather than nodes in the tree, this means you have to deal with a lot of different cases depending on whether the element has sibling elements or not, whether the existing element had a <code>.tail</code> attribute, and so on. Have I missed some easy way of doing this?

I think that unutbu's XSLT solution is probably the correct way to achieve your goal. However, here's a somewhat hacky way to achieve it, by modifying the tails of <code><r/></code> tags and then using <code>etree.strip_elements</code>. <pre class="prettyprint lang-py prettyprint-override"><code>from lxml import etree data = '''<everything> <m>Some text before <r/></m> <m><r/> and some text after.</m> <m><r/></m> <m>Text before <r/> and after</m> <m> Text after a sibling <r/> Text before a sibling</m> </everything> ''' f = etree.fromstring(data) for r in f.xpath('//r'): r.tail = 'DELETED' + r.tail if r.tail else 'DELETED' etree.strip_elements(f,'r',with_tail=False) print etree.tostring(f,pretty_print=True) </code></pre> Gives you: <pre class="prettyprint"><code><everything> <m>Some text before DELETED</m> <m>DELETED and some text after.</m> <m>DELETED</m> <m>Text before DELETED and after</m> <m> Text after a sibling DELETED Text before a sibling</m> </everything> </code></pre>

How can one replace an element with text in lxml?

Tags:

python

xml

lxml

elementtree

It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input:

input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

... you could easily remove every <r> element with:

from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
    r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)

However, how would you go about replacing each element with text, to get the output:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

It seems to me that because the ElementTree API deals with text via the .text and .tail attributes of each element rather than nodes in the tree, this means you have to deal with a lot of different cases depending on whether the element has sibling elements or not, whether the existing element had a .tail attribute, and so on. Have I missed some easy way of doing this?

429

asked Mar 24 '11 11:03

Mark Longair

2 Answers

Using strip_elements has the disadvantage that you cannot make it keep some of the <r> elements while replacing others. It also requires the existence of an ElementTree instance (which may be not the case). And last, you cannot use it to replace XML comments or processing instructions. The following should do your job:

for r in f.xpath('//r'):
    text = 'DELETED' + r.tail 
    parent = r.getparent()
    if parent is not None:
        previous = r.getprevious()
        if previous is not None:
            previous.tail = (previous.tail or '') + text
        else:
            parent.text = (parent.text or '') + text
        parent.remove(r)

161

answered Sep 18 '22 02:09

bernulf

I think that unutbu's XSLT solution is probably the correct way to achieve your goal.

However, here's a somewhat hacky way to achieve it, by modifying the tails of <r/> tags and then using etree.strip_elements.

from lxml import etree

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f = etree.fromstring(data)
for r in f.xpath('//r'):
  r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'

etree.strip_elements(f,'r',with_tail=False)

print etree.tostring(f,pretty_print=True)

Gives you:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

answered Sep 18 '22 02:09

MattH

Related questions
                            
                                Getting all Links from a page Beautiful Soup
                            
                                Creating a temporal range time-series spiral plot
                            
                                Add new column to Python Pandas DataFrame based on multiple conditions [duplicate]
                            
                                How to compute the gradients of image using Python
                            
                                ValueError: Unknown layer: CapsuleLayer
                            
                                How can I pass a defined dictionary to **kwargs in Python?
                            
                                How can I make SSE with Python (Django)?
                            
                                Django Template - New Variable Declaration
                            
                                Display some free text in between Django Form fields
                            
                                Is there a "safe" subset of Python for use as an embedded scripting language?
                            
                                Counting python method calls within another method
                            
                                How to get/set local variables of a function (from outside) in Python?
                            
                                What is the simplest URL shortener application one could write in python for the Google App Engine?
                            
                                How to install pycairo on osx?
                            
                                Obtain & manipulate bit pattern of float as integer
                            
                                How does Qt work (exactly)?
                            
                                AttributeError: 'datetime.date' object has no attribute 'date'
                            
                                Installing Python-2.7 on Ubuntu 10.4
                            
                                parallel file parsing, multiple CPU cores
                            
                                Python list should be empty on class instance initialisation, but it's not. Why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With