<p>I'm familiar with etree's <code>strip_tags</code> and <code>strip_elements</code> methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.</p> <p>For instance: I'd like to strip all <code>span</code> or <code>div</code> tags (or other elements) from a tree (<code>xhtm</code>l) that have a <code>class='myclass'</code> attribute/value (preserving the element's contents like <code>strip_tags</code> would do). Meanwhile, those same elements that <em>don't</em> have <code>class='myclass'</code> should remain untouched.</p> <p>Conversely: I'd like a way to strip all "naked" <code>spans</code> or <code>divs</code> from a tree. Meaning only those <code>spans</code>/<code>divs</code> (or any other elements for that matter) that have absolutely <em>no</em> attributes. Leaving those same elements that <em>have</em> attributes (any) untouched.</p> <p>I feel I'm missing something obvious, but I've been searching without any luck for quite some time.</p>

<h3>HTML</h3> <p><code>lxml</code>s HTML elements have a method <code>drop_tag()</code> which you can call on any element in a tree parsed by <code>lxml.html</code>.</p> <p>It acts similar to <code>strip_tags</code> in that it removes the element, but retains the text, and it can be called <em>on</em> the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:</p> <p><strong><code>doc.html</code></strong></p> <pre class="prettyprint lang-xml prettyprint-override"><code><html> <body> <div>This is some <span attr="foo">Text</span>.</div> <div>Some <span>more</span> text.</div> <div>Yet another line <span attr="bar">of</span> text.</div> <div>This span will get <span attr="foo">removed</span> as well.</div> <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div> <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div> </body> </html> </code></pre> <p><strong><code>strip.py</code></strong></p> <pre class="prettyprint"><code>from lxml import etree from lxml import html doc = html.parse(open('doc.html')) spans_with_attrs = doc.xpath("//span[@attr='foo']") for span in spans_with_attrs: span.drop_tag() print etree.tostring(doc) </code></pre> <p><strong>Output:</strong></p> <pre class="prettyprint lang-xml prettyprint-override"><code><html> <body> <div>This is some Text.</div> <div>Some <span>more</span> text.</div> <div>Yet another line <span attr="bar">of</span> text.</div> <div>This span will get removed as well.</div> <div>Nested elements will <b>be</b> left alone.</div> <div>Unless they also match.</div> </body> </html> </code></pre> <p>In this case, the XPath expression <code>//span[@attr='foo']</code> selects all the <code>span</code> elements with an attribute <code>attr</code> of value <code>foo</code>. See this XPath tutorial for more details on how to construct XPath expressions.</p> <h3>XML / XHTML</h3> <p><strong>Edit</strong>: <em>I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the <code>drop_tag()</code> method is really only available for elements in a HTML document.</em></p> <p>So for XML it's a bit more complicated:</p> <p><strong><code>doc.xml</code></strong></p> <pre class="prettyprint lang-xml prettyprint-override"><code><document> <node>This is <span>some</span> text.</node> <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node> </document> </code></pre> <p><strong><code>strip.py</code></strong></p> <pre class="prettyprint"><code>from lxml import etree def strip_nodes(nodes): for node in nodes: text_content = node.xpath('string()') # Include tail in full_text because it will be removed with the node full_text = text_content + (node.tail or '') parent = node.getparent() prev = node.getprevious() if prev: # There is a previous node, append text to its tail prev.tail += full_text else: # It's the first node in <parent/>, append to parent's text parent.text = (parent.text or '') + full_text parent.remove(node) doc = etree.parse(open('doc.xml')) nodes = doc.xpath("//span[@attr='foo']") strip_nodes(nodes) print etree.tostring(doc) </code></pre> <p><strong>Output:</strong></p> <pre class="prettyprint lang-xml prettyprint-override"><code><document> <node>This is <span>some</span> text.</node> <node>Only this first span should <span>be</span> removed.</node> </document> </code></pre> <p>As you can see, this will replace node <em>and</em> all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)</p> <p><strong>NOTE</strong> Last edit have changed the code in question.</p>

<p>I just had the same problem, and after some cosideration had this rather hacky idea, which is borrowed from regex-ing Markup in Perl onliners: How about first catching all unwanted Elements with all the power that <code>element.iterfind</code> brings, renaming those elements to something unlikely, and then strip all those elements?</p> <p>Yes,this isn't absolutely clean and robust, as you always might have a document that actually uses the "unlikely" tag name you've chosen, but the resulting code IS rather clean and easily maintainable. If you really need to be sure that whatever "unlikely" name you've picked doesn't exist already in the document, you can always check for it's existing first, and do the renaming only if you can't find any pre-existing tags of that name.</p> <p><strong>doc.xml</strong></p> <pre class="prettyprint"><code><document> <node>This is <span>some</span> text.</node> <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node> </document> </code></pre> <p><strong>strip.py</strong></p> <pre class="prettyprint"><code>from lxml import etree xml = etree.parse("doc.xml") deltag ="xxyyzzdelme" for el in xml.iterfind("//span[@attr='foo']"): el.tag = deltag etree.strip_tag(xml, deltag) print(etree.tostring(xml, encoding="unicode", pretty_print=True)) </code></pre> <p><strong>Output</strong></p> <pre class="prettyprint"><code><document> <node>This is <span>some</span> text.</node> <node>Only this first <b>span</b> should <span>be</span> removed.</node> </document> </code></pre>

Using Python and lxml to strip only the tags that have certain attributes/values

Tags:

python

lxml

I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.

For instance: I'd like to strip all span or div tags (or other elements) from a tree (xhtml) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched.

Conversely: I'd like a way to strip all "naked" spans or divs from a tree. Meaning only those spans/divs (or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.

I feel I'm missing something obvious, but I've been searching without any luck for quite some time.

791

asked Feb 10 '14 19:02

Bush League

2 Answers

HTML

lxmls HTML elements have a method drop_tag() which you can call on any element in a tree parsed by lxml.html.

It acts similar to strip_tags in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:

doc.html

<html>
    <body>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>

strip.py

from lxml import etree
from lxml import html

doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")

for span in spans_with_attrs:
    span.drop_tag()

print etree.tostring(doc)

Output:

<html>
    <body>
        <div>This is some Text.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get removed as well.</div>
        <div>Nested elements will <b>be</b> left alone.</div>
        <div>Unless they also match.</div>
    </body>
</html>

In this case, the XPath expression //span[@attr='foo'] selects all the span elements with an attribute attr of value foo. See this XPath tutorial for more details on how to construct XPath expressions.

XML / XHTML

Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag() method is really only available for elements in a HTML document.

So for XML it's a bit more complicated:

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree


def strip_nodes(nodes):
    for node in nodes:
        text_content = node.xpath('string()')

        # Include tail in full_text because it will be removed with the node
        full_text = text_content + (node.tail or '')

        parent = node.getparent()
        prev = node.getprevious()
        if prev:
            # There is a previous node, append text to its tail
            prev.tail += full_text
        else:
            # It's the first node in <parent/>, append to parent's text
            parent.text = (parent.text or '') + full_text
        parent.remove(node)


doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)

print etree.tostring(doc)

Output:

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this first span should <span>be</span> removed.</node>
</document>

As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)

NOTE Last edit have changed the code in question.

answered Oct 15 '22 12:10

Lukas Graf

I just had the same problem, and after some cosideration had this rather hacky idea, which is borrowed from regex-ing Markup in Perl onliners: How about first catching all unwanted Elements with all the power that element.iterfind brings, renaming those elements to something unlikely, and then strip all those elements?

Yes,this isn't absolutely clean and robust, as you always might have a document that actually uses the "unlikely" tag name you've chosen, but the resulting code IS rather clean and easily maintainable. If you really need to be sure that whatever "unlikely" name you've picked doesn't exist already in the document, you can always check for it's existing first, and do the renaming only if you can't find any pre-existing tags of that name.

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree
xml = etree.parse("doc.xml")
deltag ="xxyyzzdelme"
for el in xml.iterfind("//span[@attr='foo']"):
    el.tag = deltag
etree.strip_tag(xml, deltag)
print(etree.tostring(xml, encoding="unicode", pretty_print=True))

Output

<document>
     <node>This is <span>some</span> text.</node>
     <node>Only this first <b>span</b> should <span>be</span> removed.</node>
</document>

answered Oct 15 '22 12:10

Thor

Related questions
                            
                                Why defaultdict constructor takes a function and not a constant
                            
                                Python: Decorating a class method that is intended to be overwritten when inherited
                            
                                Creating new object instance still has old data in it [duplicate]
                            
                                How do I handle exceptions on Python Social Auth [closed]
                            
                                How to get error location from json.loads in Python
                            
                                Open images from a folder one by one using python?
                            
                                Serial import python
                            
                                Why isn't setup.py dependency_links doing anything?
                            
                                Add rate of change column to Pandas DataFrame
                            
                                difference of two sets of intervals
                            
                                Calculate weighted pairwise distance matrix in Python
                            
                                Python string encoding for a variable
                            
                                Saving a file in Mongodb's GridFS with pymongo results in a truncated file - python 2.7 on Windows 7
                            
                                TimeSeries with a groupby in Pandas
                            
                                Find equidistant points between two coordinates
                            
                                NameError: name 'self' is not defined, even though it is?
                            
                                strange numpy fft performance
                            
                                "TypeError: 'unicode' object does not support item assignment" in dictionaries
                            
                                Is there a way to remember the position in a python iterator?
                            
                                Flask static files getting 404

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With