Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Python and lxml to strip only the tags that have certain attributes/values

Tags:

python

lxml

I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.

For instance: I'd like to strip all span or div tags (or other elements) from a tree (xhtml) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched.

Conversely: I'd like a way to strip all "naked" spans or divs from a tree. Meaning only those spans/divs (or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.

I feel I'm missing something obvious, but I've been searching without any luck for quite some time.

like image 791
Bush League Avatar asked Feb 10 '14 19:02

Bush League


People also ask

What is lxml for Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.


2 Answers

HTML

lxmls HTML elements have a method drop_tag() which you can call on any element in a tree parsed by lxml.html.

It acts similar to strip_tags in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:

doc.html

<html>
    <body>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>

strip.py

from lxml import etree
from lxml import html

doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")

for span in spans_with_attrs:
    span.drop_tag()

print etree.tostring(doc)

Output:

<html>
    <body>
        <div>This is some Text.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get removed as well.</div>
        <div>Nested elements will <b>be</b> left alone.</div>
        <div>Unless they also match.</div>
    </body>
</html>

In this case, the XPath expression //span[@attr='foo'] selects all the span elements with an attribute attr of value foo. See this XPath tutorial for more details on how to construct XPath expressions.

XML / XHTML

Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag() method is really only available for elements in a HTML document.

So for XML it's a bit more complicated:

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree


def strip_nodes(nodes):
    for node in nodes:
        text_content = node.xpath('string()')

        # Include tail in full_text because it will be removed with the node
        full_text = text_content + (node.tail or '')

        parent = node.getparent()
        prev = node.getprevious()
        if prev:
            # There is a previous node, append text to its tail
            prev.tail += full_text
        else:
            # It's the first node in <parent/>, append to parent's text
            parent.text = (parent.text or '') + full_text
        parent.remove(node)


doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)

print etree.tostring(doc)

Output:

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this first span should <span>be</span> removed.</node>
</document>

As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)

NOTE Last edit have changed the code in question.

like image 92
Lukas Graf Avatar answered Oct 15 '22 12:10

Lukas Graf


I just had the same problem, and after some cosideration had this rather hacky idea, which is borrowed from regex-ing Markup in Perl onliners: How about first catching all unwanted Elements with all the power that element.iterfind brings, renaming those elements to something unlikely, and then strip all those elements?

Yes,this isn't absolutely clean and robust, as you always might have a document that actually uses the "unlikely" tag name you've chosen, but the resulting code IS rather clean and easily maintainable. If you really need to be sure that whatever "unlikely" name you've picked doesn't exist already in the document, you can always check for it's existing first, and do the renaming only if you can't find any pre-existing tags of that name.

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree
xml = etree.parse("doc.xml")
deltag ="xxyyzzdelme"
for el in xml.iterfind("//span[@attr='foo']"):
    el.tag = deltag
etree.strip_tag(xml, deltag)
print(etree.tostring(xml, encoding="unicode", pretty_print=True))

Output

<document>
     <node>This is <span>some</span> text.</node>
     <node>Only this first <b>span</b> should <span>be</span> removed.</node>
</document>
like image 21
Thor Avatar answered Oct 15 '22 12:10

Thor