<p>I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <code><script></code>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?</p>

<p>Use <code>lxml.html.clean</code>! It's VERY easy!</p> <pre class="prettyprint"><code>from lxml.html.clean import clean_html print clean_html(html) </code></pre> <p>Suppose the following html:</p> <pre class="prettyprint"><code>html = '''\ <html> <head> <script type="text/javascript" src="evil-site"></script> <link rel="alternate" type="text/rss" src="evil-rss"> <style> body {background-image: url(javascript:do_evil)}; div {color: expression(evil)}; </style> </head> <body onload="evil_function()">  <a href="javascript:evil_function()">a link</a> <a href="#" onclick="evil_function()">another link</a> <p onclick="evil_function()">a paragraph</p> <div style="display: none">secret EVIL!</div> <object> of EVIL! </object> <iframe src="evil-site"></iframe> <form action="evil-site"> Password: <input type="password" name="password"> </form> <blink>annoying EVIL!</blink> <a href="evil-site">spam spam SPAM!</a> <image src="evil!"> </body> </html>''' </code></pre> <p>The results...</p> <pre class="prettyprint"><code><html> <body> <div> <style>/* deleted */</style> <a href="">a link</a> <a href="#">another link</a> <p>a paragraph</p> <div>secret EVIL!</div> of EVIL! Password: annoying EVIL! <a href="evil-site">spam spam SPAM!</a> <img src="evil!"> </div> </body> </html> </code></pre> <p>You can customize the elements you want to clean and whatnot.</p>

Remove HTML tags not on an allowed list from a Python string

Tags:

python

html

I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?

935

asked Mar 30 '09 23:03

Everett Toews

2 Answers

Use lxml.html.clean! It's VERY easy!

from lxml.html.clean import clean_html print clean_html(html)

Suppose the following html:

html = '''\ <html>  <head>    <script type="text/javascript" src="evil-site"></script>    <link rel="alternate" type="text/rss" src="evil-rss">    <style>      body {background-image: url(javascript:do_evil)};      div {color: expression(evil)};    </style>  </head>  <body onload="evil_function()">     <!-- I am interpreted for EVIL! -->    <a href="javascript:evil_function()">a link</a>    <a href="#" onclick="evil_function()">another link</a>    <p onclick="evil_function()">a paragraph</p>    <div style="display: none">secret EVIL!</div>    <object> of EVIL! </object>    <iframe src="evil-site"></iframe>    <form action="evil-site">      Password: <input type="password" name="password">    </form>    <blink>annoying EVIL!</blink>    <a href="evil-site">spam spam SPAM!</a>    <image src="evil!">  </body> </html>'''

The results...

<html>   <body>     <div>       <style>/* deleted */</style>       <a href="">a link</a>       <a href="#">another link</a>       <p>a paragraph</p>       <div>secret EVIL!</div>       of EVIL!       Password:       annoying EVIL!       <a href="evil-site">spam spam SPAM!</a>       <img src="evil!">     </div>   </body> </html>

You can customize the elements you want to clean and whatnot.

145

answered Sep 22 '22 09:09

nosklo

Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup  VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']  def sanitize_html(value):      soup = BeautifulSoup(value)      for tag in soup.findAll(True):         if tag.name not in VALID_TAGS:             tag.hidden = True      return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

answered Sep 19 '22 09:09

bryan

Related questions
                            
                                "Cannot open include file: 'config-win.h': No such file or directory" while installing mysql-python
                            
                                Numpy: Creating a complex array from 2 real ones?
                            
                                Python: download files from google drive using url
                            
                                How to hide *pyc files in atom editor
                            
                                overlay a smaller image on a larger image python OpenCv
                            
                                How to make a Tkinter window jump to the front?
                            
                                In what order does os.walk iterates iterate? [duplicate]
                            
                                missing python bz2 module
                            
                                How to create user from django shell
                            
                                How to split strings into text and number?
                            
                                Python: download a file from an FTP server
                            
                                How to Fix Permissions on Home-brew on MacOS High Sierra
                            
                                Convert alphabet letters to number in Python
                            
                                Complete scan of dynamoDb with boto3
                            
                                Launch a shell command with in a python script, wait for the termination and return to the script
                            
                                Locale date formatting in Python
                            
                                Passing values in Python [duplicate]
                            
                                How to obtain image size using standard Python class (without using external library)?
                            
                                Can a lambda function call itself recursively in Python?
                            
                                Parallelize apply after pandas groupby

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With