Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove tag from text with BeautifulSoup

A lot of questions here with similar title but I'm trying to remove the tag from the soup object itself.

I have a page that contains among other things this div:

<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>

I can select <div id="content"> with soup.find('div', id='content') but I want to remove the <div id="blah"> from it.

like image 927
Juicy Avatar asked Jul 16 '15 10:07

Juicy


People also ask

How do I remove a tag from Beautiful Soup?

Beautiful Soup also allows for the removal of tags from the document. This is accomplished using the decompose() and extract() methods.

How do you remove HTML tags from text in Python?

Remove HTML tags from string in python Using the lxml Module The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml. etree.

How do you delete a tag in Python?

For this, decompose() method is used which comes built into the module. Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.


2 Answers

You can use extract if you want to remove a tag or string from the tree.

In [13]: soup = BeautifulSoup("""<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>""")

In [14]: soup = BeautifulSoup("""<div id="content">
   ....: I want to keep this<br /><div id="blah">I want to remove this</div>
   ....: </div>""")

In [15]: blah = soup.find(id='blah')

In [16]: _ = blah.extract()

In [17]: soup
Out[17]: 
<html><body><div id="content">
I want to keep this<br/>
</div></body></html>
like image 189
styvane Avatar answered Sep 29 '22 04:09

styvane


The Tag.decompose method removes tag from the tree. So find the div tag:

div = soup.find('div', {'id':'content'})

Loop over all the children but the first:

for child in list(div)[1:]:

and try to decompose the children:

    try:
        child.decompose()
    except AttributeError: pass

import bs4 as bs

content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
soup = bs.BeautifulSoup(content)
div = soup.find('div', {'id':'content'})
for child in list(div)[1:]:
    try:
        child.decompose()
    except AttributeError: pass
print(div)

yields

<div id="content">
I want to keep this
</div>

The equivalent using lxml would be

import lxml.html as LH

content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
root = LH.fromstring(content)

div = root.xpath('//div[@id="content"]')[0]
for child in div:
    div.remove(child)
print(LH.tostring(div))
like image 21
unutbu Avatar answered Sep 29 '22 04:09

unutbu