Example:
html = <a><b>Text</b>Text2</a>
BeautifullSoup code
[x.extract() for x in html.findAll(.//b)]
in exit we have:
html = <a>Text2</a>
Lxml code:
[bad.getparent().remove(bad) for bad in html.xpath(".//b")]
in exit we have:
html = <a></a>
because lxml think "Text2" it's a tail of <b></b>
If we need only text line from join of tags we can use:
for bad in raw.xpath(xpath_search):
bad.text = ''
But, how do that without changing text, but remove tags without tail?
While the accepted answer from phlou will work, there are easier ways to remove tags without also removing their tails.
If you want to remove a specific element, then the LXML method you are looking for is drop_tree
.
From the docs:
Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.
If you want to remove all instances of a specific tag, you can use the lxml.etree.strip_elements
or lxml.html.etree.strip_elements
with with_tail=False
.
Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the
with_tail
keyword argument option to False.
So, for the example in the original post:
>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
... bad.drop_tree()
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
or
>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
Edit:
I did the following to safe the tail text to the previous sibling or parent.
def remove_keeping_tail(self, element):
"""Safe the tail text and then delete the element"""
self._preserve_tail_before_delete(element)
element.getparent().remove(element)
def _preserve_tail_before_delete(self, node):
if node.tail: # preserve the tail
previous = node.getprevious()
if previous is not None: # if there is a previous sibling it will get the tail
if previous.tail is None:
previous.tail = node.tail
else:
previous.tail = previous.tail + node.tail
else: # The parent get the tail as text
parent = node.getparent()
if parent.text is None:
parent.text = node.tail
else:
parent.text = parent.text + node.tail
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With