I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?
The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.
Use lxml.html.clean! It's VERY easy!
from lxml.html.clean import clean_html print clean_html(html)   Suppose the following html:
html = '''\ <html>  <head>    <script type="text/javascript" src="evil-site"></script>    <link rel="alternate" type="text/rss" src="evil-rss">    <style>      body {background-image: url(javascript:do_evil)};      div {color: expression(evil)};    </style>  </head>  <body onload="evil_function()">     <!-- I am interpreted for EVIL! -->    <a href="javascript:evil_function()">a link</a>    <a href="#" onclick="evil_function()">another link</a>    <p onclick="evil_function()">a paragraph</p>    <div style="display: none">secret EVIL!</div>    <object> of EVIL! </object>    <iframe src="evil-site"></iframe>    <form action="evil-site">      Password: <input type="password" name="password">    </form>    <blink>annoying EVIL!</blink>    <a href="evil-site">spam spam SPAM!</a>    <image src="evil!">  </body> </html>'''   The results...
<html>   <body>     <div>       <style>/* deleted */</style>       <a href="">a link</a>       <a href="#">another link</a>       <p>a paragraph</p>       <div>secret EVIL!</div>       of EVIL!       Password:       annoying EVIL!       <a href="evil-site">spam spam SPAM!</a>       <img src="evil!">     </div>   </body> </html>   You can customize the elements you want to clean and whatnot.
Here's a simple solution using BeautifulSoup:
from bs4 import BeautifulSoup  VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']  def sanitize_html(value):      soup = BeautifulSoup(value)      for tag in soup.findAll(True):         if tag.name not in VALID_TAGS:             tag.hidden = True      return soup.renderContents()   If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.
You might also look into using lxml and Tidy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With