Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Beautiful Soup to strip html tags from a string

Does anyone have some sample code that illustrates how to use Python's Beautiful Soup to strip all html tags, except some, from a string of text?

I want to strip all javascript and html tags everything except:

<a></a>
<b></b>
<i></i>

And also things like:

<a onclick=""></a>

Thanks for helping -- I couldn't find much on the internet for this purpose.

like image 838
ensnare Avatar asked Dec 12 '10 20:12

ensnare


People also ask

How do I remove a tag from beautiful soup?

Beautiful Soup also allows for the removal of tags from the document. This is accomplished using the decompose() and extract() methods.

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

How do you remove all HTML tags from text in Python?

Remove HTML tags from string in python Using Regular Expressions. Regular expressions are one of the best ways to process text data. We can also remove HTML tags from string in python using regular expressions. For this, we can use the sub() method defined in the regex module.

What function in BeautifulSoup will remove a tag from the HTML tree and destroy it?

decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.


1 Answers

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

yields

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

If you just want the text contents, you could change print(tag) to print(tag.string).

If you want to remove an attribute like onclick="" from the a tag, you could do this:

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)
like image 67
unutbu Avatar answered Oct 20 '22 11:10

unutbu