<p>I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string.</p> <p>However...</p> <pre class="prettyprint"><code>>>> unicode(BeautifulSoup('text < text')) u'text < text' </code></pre> <p>That doesn't look like valid HTML to me. And with my tag stripper, it opens the way to all sorts of nastiness:</p> <pre class="prettyprint"><code>>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify() < <script> </script> script>alert("xss")< <script> </script> script> </code></pre> <p>The <code><script></script></code> pairs will be removed, and what remains is not only an XSS attack, but even valid HTML as well.</p> <p>The obvious solution is to replace all <code><</code> characters by <code>&lt;</code> that, after parsing, are found not to belong to a tag (and similar for <code>>&'"</code>). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all <code>NavigableString</code> nodes, but since I might miss something, I'd rather let some tried and tested code do the work.</p> <p><strong>Why doesn't Beautiful Soup escape <code><</code> (and other magic characters) by default, and how do I make it do that?</strong></p> <hr> <p>N.B. I've also looked at <code>lxml.html.clean</code>. It seems to work on the basis of blacklisting, not whitelisting, so it doesn't seem very safe to me. Tags can be whitelisted, but attributes cannot, and it allows too many attributes for my taste (e.g. <code>tabindex</code>). Also, it gives an <code>AssertionError</code> on the input <code><SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT></code>. Not good.</p> <p>Suggestions for other ways to clean HTML are also very welcome. I'm hardly the only person in the world trying to do this, yet there seems to be no standard solution.</p>

<p>I know this is 3.5yrs after your original question, but you can use the <code>formatter='html'</code> argument to <code>prettify()</code>, <code>encode()</code>, or <code>decode()</code> to produce well-formed HTML.</p>

How to make Beautiful Soup output HTML entities?

Tags:

python

html

beautifulsoup

xss

I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string.

However...

>>> unicode(BeautifulSoup('text < text'))
u'text < text'

That doesn't look like valid HTML to me. And with my tag stripper, it opens the way to all sorts of nastiness:

>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>

The <script></script> pairs will be removed, and what remains is not only an XSS attack, but even valid HTML as well.

The obvious solution is to replace all < characters by < that, after parsing, are found not to belong to a tag (and similar for >&'"). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all NavigableString nodes, but since I might miss something, I'd rather let some tried and tested code do the work.

Why doesn't Beautiful Soup escape < (and other magic characters) by default, and how do I make it do that?

N.B. I've also looked at lxml.html.clean. It seems to work on the basis of blacklisting, not whitelisting, so it doesn't seem very safe to me. Tags can be whitelisted, but attributes cannot, and it allows too many attributes for my taste (e.g. tabindex). Also, it gives an AssertionError on the input <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>. Not good.

Suggestions for other ways to clean HTML are also very welcome. I'm hardly the only person in the world trying to do this, yet there seems to be no standard solution.

542

asked Sep 10 '10 11:09

Thomas

2 Answers

I know this is 3.5yrs after your original question, but you can use the formatter='html' argument to prettify(), encode(), or decode() to produce well-formed HTML.

188

answered Sep 28 '22 05:09

Jason S

The lxml.html.clean.Cleaner class does allow you to provide a tag whitelist with the allow_tags argument and to use the precomputed attribute whitelist from feedparser with the safe_attrs_only argument. And lxml definitely handles entities properly on serialization.

answered Sep 28 '22 07:09

llasram

Related questions
                            
                                Pycharm Can't retrieve image ID from build stream
                            
                                Google Cloud Functions Deploy "allow unauthenticated invocations..."
                            
                                pydantic: Using property.getter decorator for a field with an alias
                            
                                How to do persistent database connection in FastAPI?
                            
                                How to type-hint a matplotlib.axes._subplots.AxesSubplots object in python3
                            
                                Groupby based on a multiple logical conditions applied to a different columns DataFrame
                            
                                Python multiprocessing within Flask request with Gunicorn + Nginx
                            
                                Delete directory and all symlinks recursively
                            
                                Do any Python ORMs (SQLAlchemy?) work with Google App Engine?
                            
                                Python multiprocessing: restrict number of cores used
                            
                                Getting a Python library listed in easy_setup and pip?
                            
                                Solving thread cleanup on paramiko
                            
                                Library for programming Abstract Syntax Trees in Python
                            
                                Opensource Voting System [closed]
                            
                                Stored Procedures in Python for PostgreSQL
                            
                                django - inlineformset_factory with more than one ForeignKey
                            
                                on click event in wx.Panel?
                            
                                Cross-platform addressing of the config file
                            
                                How to add python support to already installed postgreSQL? [closed]
                            
                                Python: using downloaded modules

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With