Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an Open Source Python library for sanitizing HTML and removing all Javascript?

I want to write a web application that allows users to enter any HTML that can occur inside a <div> element. This HTML will then end up being displayed to other users, so I want to make sure that the site doesn't open people up to XSS attacks.

Is there a nice library in Python that will clean out all the event handler attributes, <script> elements and other Javascript cruft from HTML or a DOM tree?

I am intending to use Beautiful Soup to regularize the HTML to make sure it doesn't contain unclosed tags and such. But, as far as I can tell, it has no pre-packaged way to strip all Javascript.

If there is a nice library in some other language, that might also work, but I would really prefer Python.

I've done a bunch of Google searching and hunted around on pypi, but haven't been able to find anything obvious.

Related

  • Sanitising user input using Python
like image 788
Omnifarious Avatar asked Jan 23 '23 21:01

Omnifarious


2 Answers

As Klaus mentions, the clear consensus in the community is to use BeautifulSoup for these tasks:

soup = BeautifulSoup.BeautifulSoup(html)
for script_elt in soup.findAll('script'):
    script_elt.extract()
html = str(soup)
like image 136
Ned Batchelder Avatar answered Jan 25 '23 11:01

Ned Batchelder


Whitelist approach to allowed tags, attributes and their values is the only reliable way. Take a look at Recipe 496942: Cross-site scripting (XSS) defense

What is wrong with existing markup languages such as used on this very site?

like image 30
jfs Avatar answered Jan 25 '23 10:01

jfs