Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove HTML tags not on an allowed list from a Python string

Tags:

python

html

I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?

like image 935
Everett Toews Avatar asked Mar 30 '09 23:03

Everett Toews


People also ask

How do I remove a tag from a string?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.


2 Answers

Use lxml.html.clean! It's VERY easy!

from lxml.html.clean import clean_html print clean_html(html) 

Suppose the following html:

html = '''\ <html>  <head>    <script type="text/javascript" src="evil-site"></script>    <link rel="alternate" type="text/rss" src="evil-rss">    <style>      body {background-image: url(javascript:do_evil)};      div {color: expression(evil)};    </style>  </head>  <body onload="evil_function()">     <!-- I am interpreted for EVIL! -->    <a href="javascript:evil_function()">a link</a>    <a href="#" onclick="evil_function()">another link</a>    <p onclick="evil_function()">a paragraph</p>    <div style="display: none">secret EVIL!</div>    <object> of EVIL! </object>    <iframe src="evil-site"></iframe>    <form action="evil-site">      Password: <input type="password" name="password">    </form>    <blink>annoying EVIL!</blink>    <a href="evil-site">spam spam SPAM!</a>    <image src="evil!">  </body> </html>''' 

The results...

<html>   <body>     <div>       <style>/* deleted */</style>       <a href="">a link</a>       <a href="#">another link</a>       <p>a paragraph</p>       <div>secret EVIL!</div>       of EVIL!       Password:       annoying EVIL!       <a href="evil-site">spam spam SPAM!</a>       <img src="evil!">     </div>   </body> </html> 

You can customize the elements you want to clean and whatnot.

like image 145
nosklo Avatar answered Sep 22 '22 09:09

nosklo


Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup  VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']  def sanitize_html(value):      soup = BeautifulSoup(value)      for tag in soup.findAll(True):         if tag.name not in VALID_TAGS:             tag.hidden = True      return soup.renderContents() 

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

like image 26
bryan Avatar answered Sep 19 '22 09:09

bryan