What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?
sanitize is a Python module for making sure various things (e.g. HTML) are safe to use. It was originally written by Mark Pilgrim and is distributed under the BSD license.
User input should always be treated as malicious before making it down into lower layers of your application. Always handle sanitizing input as soon as possible and should not for any reason be stored in your database before checking for malicious intent.
Input sanitization is a cybersecurity measure of checking, cleaning, and filtering data inputs from users, APIs, and web services of any unwanted characters and strings to prevent the injection of harmful codes into the system.
Django HTML Sanitizer provides a set of utilities to easily sanitize/escape/clean HTML inputs in django. This app is built on top of bleach, the excellent Python HTML sanitizer.
Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick
).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:..."
, and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja	vascript:alert('hi')">
or <a href="ja vascript:alert('hi')">
, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re from urlparse import urljoin from BeautifulSoup import BeautifulSoup, Comment def sanitizeHtml(value, base_url=None): rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:')) rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:')) re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE) validTags = 'p i strong b u a h1 h2 h3 pre br img'.split() validAttrs = 'href src width height'.split() urlAttrs = 'href src'.split() # Attributes which should have a URL soup = BeautifulSoup(value) for comment in soup.findAll(text=lambda text: isinstance(text, Comment)): # Get rid of comments comment.extract() for tag in soup.findAll(True): if tag.name not in validTags: tag.hidden = True attrs = tag.attrs tag.attrs = [] for attr, val in attrs: if attr in validAttrs: val = re_scripts.sub('', val) # Remove scripts (vbs & js) if attr in urlAttrs: val = urljoin(base_url, val) # Calculate the absolute url tag.attrs.append((attr, val)) return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With