I've got the common situation where I've got user input that uses a subset of HTML (input with tinyMCE). I need to have some server-side protection against XSS attacks and am looking for a well-tested tool that people are using to do this. On the PHP side I'm seeing lots of libraries like HTMLPurifier that do the job, but I can't seem to find anything in .NET.
I'm basically looking for a library to filter down to a whitelist of tags, attributes on those tags, and does the right thing with "difficult" attributes like a:href and img:src
I've seen Jeff Atwood's post at http://refactormycode.com/codes/333-sanitize-html, but I don't know how up-to-date it is. Does it have any bearing at all to what the site is currently using? And in any case I'm not sure I'm comfortable with that strategy of trying to regexp out valid input.
This blog post lays out what seems to be a much more compelling strategy:
http://blog.bvsoftware.com/post/2009/01/08/How-to-filter-Html-Input-to-Prevent-Cross-Site-Scripting-but-Still-Allow-Design.aspx
This method is to actually parse the HTML into a DOM, validate that, then rebuild valid HTML from it. If the HTML parsing can handle malformed HTML sensibly, then great. If not, no big deal -- I can demand well-formed HTML since the users should be using the tinyMCE editor. In either case I'm rewriting what I know is safe, well-formed HTML.
The problem is that's just a description, without a link to any library that actually executes that algorithm.
Does such a library exist? If not, what would be a good .NET HTML parsing engine? And what regular expressions should be used to perform extra validation a:href, img:src? Am I missing something else important here?
I don't want re-implement a buggy wheel here. Surely there's some commonly used libraries out there. Any ideas?
We are using the HtmlSanitizer .Net library, which:
Also on NuGet
Well if you want to parse, and you're worried about invalid (x)HTML coming in then the HTML Agility Pack is probably the best thing to use for parsing. Remember though it's not just elements, but also attributes on allowed elements you need to allow (of course you should work to an allowed whitelist of elements and their attributes, rather than try to strip things that might be dodgy via a blacklist)
There's also the OWASP AntiSamy Project which is an ongoing work in progress - they also have a test site you can try to XSS
Regex for this is probably too risky IMO.
Microsoft has an open-source library to protect against XSS: AntiXSS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With