I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>
, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?
The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.
Use lxml.html.clean
! It's VERY easy!
from lxml.html.clean import clean_html print clean_html(html)
Suppose the following html:
html = '''\ <html> <head> <script type="text/javascript" src="evil-site"></script> <link rel="alternate" type="text/rss" src="evil-rss"> <style> body {background-image: url(javascript:do_evil)}; div {color: expression(evil)}; </style> </head> <body onload="evil_function()"> <!-- I am interpreted for EVIL! --> <a href="javascript:evil_function()">a link</a> <a href="#" onclick="evil_function()">another link</a> <p onclick="evil_function()">a paragraph</p> <div style="display: none">secret EVIL!</div> <object> of EVIL! </object> <iframe src="evil-site"></iframe> <form action="evil-site"> Password: <input type="password" name="password"> </form> <blink>annoying EVIL!</blink> <a href="evil-site">spam spam SPAM!</a> <image src="evil!"> </body> </html>'''
The results...
<html> <body> <div> <style>/* deleted */</style> <a href="">a link</a> <a href="#">another link</a> <p>a paragraph</p> <div>secret EVIL!</div> of EVIL! Password: annoying EVIL! <a href="evil-site">spam spam SPAM!</a> <img src="evil!"> </div> </body> </html>
You can customize the elements you want to clean and whatnot.
Here's a simple solution using BeautifulSoup:
from bs4 import BeautifulSoup VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br'] def sanitize_html(value): soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.hidden = True return soup.renderContents()
If you want to remove the contents of the invalid tags as well, substitute tag.extract()
for tag.hidden
.
You might also look into using lxml and Tidy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With