Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How delete tag from node in lxml without tail?

Example:

html = <a><b>Text</b>Text2</a>

BeautifullSoup code

[x.extract() for x in html.findAll(.//b)]

in exit we have:

html = <a>Text2</a>

Lxml code:

[bad.getparent().remove(bad) for bad in html.xpath(".//b")]

in exit we have:

html = <a></a>

because lxml think "Text2" it's a tail of <b></b>

If we need only text line from join of tags we can use:

for bad in raw.xpath(xpath_search):
    bad.text = ''

But, how do that without changing text, but remove tags without tail?

like image 382
Anton Oleynick Avatar asked Mar 21 '17 16:03

Anton Oleynick


2 Answers

While the accepted answer from phlou will work, there are easier ways to remove tags without also removing their tails.

If you want to remove a specific element, then the LXML method you are looking for is drop_tree.

From the docs:

Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.

If you want to remove all instances of a specific tag, you can use the lxml.etree.strip_elements or lxml.html.etree.strip_elements with with_tail=False.

Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.

So, for the example in the original post:

>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
...    bad.drop_tree()
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'

or

>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
like image 121
Joshmaker Avatar answered Nov 02 '22 12:11

Joshmaker


Edit:

Please look at @Joshmakers answer https://stackoverflow.com/a/47946748/8055036, which is clearly the better one.

I did the following to safe the tail text to the previous sibling or parent.

def remove_keeping_tail(self, element):
    """Safe the tail text and then delete the element"""
    self._preserve_tail_before_delete(element)
    element.getparent().remove(element)

def _preserve_tail_before_delete(self, node):
    if node.tail: # preserve the tail
        previous = node.getprevious()
        if previous is not None: # if there is a previous sibling it will get the tail
            if previous.tail is None:
                previous.tail = node.tail
            else:
                previous.tail = previous.tail + node.tail
        else: # The parent get the tail as text
            parent = node.getparent()
            if parent.text is None:
                parent.text = node.tail
            else:
                parent.text = parent.text + node.tail

HTH

like image 23
phlou Avatar answered Nov 02 '22 12:11

phlou