I'm looking for a good HTML sanitizer to use in an ASP.NET project. The catch is that the sanitizer must support style attributes, which may contain CSS properties, which must also be sanitized. So far I haven't been able to find a good product to use. Before I bite the bullet and write my own sanitizer, I thought I might try to see what people here are using first.
Libraries that I've looked at and rejected:
The ideal would be to have a whitelist-based sanitizer that also validates property values against a list of known values or regexes.
Anybody able to point me in the right direction?
Try this native .NET HTML Sanitizer project. It can understand style attributes as you want (though it doesn't try and preserve STYLE tags, it just removes them).
Additionally it's whitelist based, rather than blacklist (and it uses AngleSharp instead of CsQuery which is now deprecated). It's also on Nuget!
Look at CsQuery (which I am the primary author of) as a tool for manipulating HTML.
This is a .NET jQuery port, it provides you with complete access to HTML via the same methods you would use on the client (a DOM and jQuery's API). This makes it pretty easy to roll your own sanitizer.
Rick Strahl had a blog post recently about sanitizing HTML. He showed how to do it with his rules using HTML Agility Pack, I posted a comment there showing how to achieve the same thing more easily with CsQuery. The basics are just this, given an enumeration of tags BlackList
:
CQ doc = CQ.Create(html);
// creates a grouped selector "iframe,form,script, ..."
string selector = String.Join(",",BlackList);
// CsQuery uses the property indexer as a default method, it's identical
// to the "Select" method and functions like $(...)
doc[selector].Remove();
If you don't want to actually remove content in some tags, e.g. perhaps formatting tags you wish to prohibit, you can use jQuery's unwrap instead. This would have the effect of removing a tag but preserving its children.
doc[selector].UnWrap();
When you're done:
string cleanHtml = doc.Render();
There's more at Ricks' post for cleaning up javascript event attributes and so on, but basically CsQuery is a toolbox with a familiar and simple way to manipulate HTML. It should be easy enough to create a sanitizer that works in the way you want.
CsQuery's DOM model also contains methods to access the styles directly (e.g. in a more convenient way than just manipulating the string), if you need to do something like remove certain named styles. For example you could remove the "font-weight" style from all elements:
// use the [attribute] selector to target only elements with styles
foreach (IDomObject element in doc["[style]"]) {
if (element.HasStyle("font-weight")) {
element.RemoveStyle("font-weight");
}
}
The major shortcoming of CsQuery right now is documentation. It's API is designed to match the browser DOM and jQuery as closely as possible (given language differnces between jQuery and C#), and the public API is well commented, so it should be easy enough to code against once you get going.
But there are a handful of nonstandard methods (like "HasStyle" and "RemoveStyle" that) are unique to CsQuery. Basic usage is covered pretty well in the readme on github, though. It's also on Nuget as CsQuery
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With