A lot of questions here with similar title but I'm trying to remove the tag from the soup object itself.
I have a page that contains among other things this div
:
<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>
I can select <div id="content">
with soup.find('div', id='content')
but I want to remove the <div id="blah">
from it.
Beautiful Soup also allows for the removal of tags from the document. This is accomplished using the decompose() and extract() methods.
Remove HTML tags from string in python Using the lxml Module The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml. etree.
For this, decompose() method is used which comes built into the module. Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.
You can use extract
if you want to remove a tag or string from the tree.
In [13]: soup = BeautifulSoup("""<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>""")
In [14]: soup = BeautifulSoup("""<div id="content">
....: I want to keep this<br /><div id="blah">I want to remove this</div>
....: </div>""")
In [15]: blah = soup.find(id='blah')
In [16]: _ = blah.extract()
In [17]: soup
Out[17]:
<html><body><div id="content">
I want to keep this<br/>
</div></body></html>
The Tag.decompose
method removes tag
from the tree.
So find the div
tag:
div = soup.find('div', {'id':'content'})
Loop over all the children but the first:
for child in list(div)[1:]:
and try to decompose the children:
try:
child.decompose()
except AttributeError: pass
import bs4 as bs
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
soup = bs.BeautifulSoup(content)
div = soup.find('div', {'id':'content'})
for child in list(div)[1:]:
try:
child.decompose()
except AttributeError: pass
print(div)
yields
<div id="content">
I want to keep this
</div>
The equivalent using lxml would be
import lxml.html as LH
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
root = LH.fromstring(content)
div = root.xpath('//div[@id="content"]')[0]
for child in div:
div.remove(child)
print(LH.tostring(div))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With