I'm familiar with etree's strip_tags
and strip_elements
methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.
For instance: I'd like to strip all span
or div
tags (or other elements) from a tree (xhtm
l) that have a class='myclass'
attribute/value (preserving the element's contents like strip_tags
would do). Meanwhile, those same elements that don't have class='myclass'
should remain untouched.
Conversely: I'd like a way to strip all "naked" spans
or divs
from a tree. Meaning only those spans
/divs
(or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.
I feel I'm missing something obvious, but I've been searching without any luck for quite some time.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
lxml
s HTML elements have a method drop_tag()
which you can call on any element in a tree parsed by lxml.html
.
It acts similar to strip_tags
in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:
doc.html
<html>
<body>
<div>This is some <span attr="foo">Text</span>.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get <span attr="foo">removed</span> as well.</div>
<div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
<div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
</body>
</html>
strip.py
from lxml import etree
from lxml import html
doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")
for span in spans_with_attrs:
span.drop_tag()
print etree.tostring(doc)
Output:
<html>
<body>
<div>This is some Text.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get removed as well.</div>
<div>Nested elements will <b>be</b> left alone.</div>
<div>Unless they also match.</div>
</body>
</html>
In this case, the XPath expression //span[@attr='foo']
selects all the span
elements with an attribute attr
of value foo
. See this XPath tutorial for more details on how to construct XPath expressions.
Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag()
method is really only available for elements in a HTML document.
So for XML it's a bit more complicated:
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
strip.py
from lxml import etree
def strip_nodes(nodes):
for node in nodes:
text_content = node.xpath('string()')
# Include tail in full_text because it will be removed with the node
full_text = text_content + (node.tail or '')
parent = node.getparent()
prev = node.getprevious()
if prev:
# There is a previous node, append text to its tail
prev.tail += full_text
else:
# It's the first node in <parent/>, append to parent's text
parent.text = (parent.text or '') + full_text
parent.remove(node)
doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)
print etree.tostring(doc)
Output:
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first span should <span>be</span> removed.</node>
</document>
As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)
NOTE Last edit have changed the code in question.
I just had the same problem, and after some cosideration had this rather hacky idea, which is borrowed from regex-ing Markup in Perl onliners: How about first catching all unwanted Elements with all the power that element.iterfind
brings, renaming those elements to something unlikely, and then strip all those elements?
Yes,this isn't absolutely clean and robust, as you always might have a document that actually uses the "unlikely" tag name you've chosen, but the resulting code IS rather clean and easily maintainable. If you really need to be sure that whatever "unlikely" name you've picked doesn't exist already in the document, you can always check for it's existing first, and do the renaming only if you can't find any pre-existing tags of that name.
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
strip.py
from lxml import etree
xml = etree.parse("doc.xml")
deltag ="xxyyzzdelme"
for el in xml.iterfind("//span[@attr='foo']"):
el.tag = deltag
etree.strip_tag(xml, deltag)
print(etree.tostring(xml, encoding="unicode", pretty_print=True))
Output
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first <b>span</b> should <span>be</span> removed.</node>
</document>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With