Does anyone have some sample code that illustrates how to use Python's Beautiful Soup to strip all html tags, except some, from a string of text?
I want to strip all javascript and html tags everything except:
<a></a>
<b></b>
<i></i>
And also things like:
<a onclick=""></a>
Thanks for helping -- I couldn't find much on the internet for this purpose.
Beautiful Soup also allows for the removal of tags from the document. This is accomplished using the decompose() and extract() methods.
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
Remove HTML tags from string in python Using Regular Expressions. Regular expressions are one of the best ways to process text data. We can also remove HTML tags from string in python using regular expressions. For this, we can use the sub() method defined in the regex module.
decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.
import BeautifulSoup
doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
print(tag)
yields
<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>
If you just want the text contents, you could change print(tag)
to print(tag.string)
.
If you want to remove an attribute like onclick=""
from the a
tag, you could do this:
if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
if tag.name=='a':
del tag['onclick']
print(tag)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With