I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string.
However...
>>> unicode(BeautifulSoup('text < text'))
u'text < text'
That doesn't look like valid HTML to me. And with my tag stripper, it opens the way to all sorts of nastiness:
>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>
The <script></script>
pairs will be removed, and what remains is not only an XSS attack, but even valid HTML as well.
The obvious solution is to replace all <
characters by <
that, after parsing, are found not to belong to a tag (and similar for >&'"
). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all NavigableString
nodes, but since I might miss something, I'd rather let some tried and tested code do the work.
Why doesn't Beautiful Soup escape <
(and other magic characters) by default, and how do I make it do that?
N.B. I've also looked at lxml.html.clean
. It seems to work on the basis of blacklisting, not whitelisting, so it doesn't seem very safe to me. Tags can be whitelisted, but attributes cannot, and it allows too many attributes for my taste (e.g. tabindex
). Also, it gives an AssertionError
on the input <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
. Not good.
Suggestions for other ways to clean HTML are also very welcome. I'm hardly the only person in the world trying to do this, yet there seems to be no standard solution.
It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.
To get href with Python BeautifulSoup, we can use the find_all method. to create soup object with BeautifulSoup class called with the html string. Then we find the a elements with the href attribute returned by calling find_all with 'a' and href set to True .
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
I know this is 3.5yrs after your original question, but you can use the formatter='html'
argument to prettify()
, encode()
, or decode()
to produce well-formed HTML.
The lxml.html.clean.Cleaner
class does allow you to provide a tag whitelist with the allow_tags
argument and to use the precomputed attribute whitelist from feedparser with the safe_attrs_only
argument. And lxml definitely handles entities properly on serialization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With