Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Sanitizer for .NET that supports style tags

I'm looking for a good HTML sanitizer to use in an ASP.NET project. The catch is that the sanitizer must support style attributes, which may contain CSS properties, which must also be sanitized. So far I haven't been able to find a good product to use. Before I bite the bullet and write my own sanitizer, I thought I might try to see what people here are using first.

Libraries that I've looked at and rejected:

  • AntiXSS Library (old version is insecure, new version strips style tags)
  • AntiSamy .NET (unmaintained, lacks necessary features in the .NET version, has obsolete dependencies)
  • The HTMLAgilityPackSanitizer in AjaxControlToolkit (escapes style tags)

The ideal would be to have a whitelist-based sanitizer that also validates property values against a list of known values or regexes.

Anybody able to point me in the right direction?

like image 788
Keith Ripley Avatar asked Aug 16 '12 02:08

Keith Ripley


2 Answers

Try this native .NET HTML Sanitizer project. It can understand style attributes as you want (though it doesn't try and preserve STYLE tags, it just removes them).

Additionally it's whitelist based, rather than blacklist (and it uses AngleSharp instead of CsQuery which is now deprecated). It's also on Nuget!

like image 190
pattermeister Avatar answered Oct 21 '22 01:10

pattermeister


Look at CsQuery (which I am the primary author of) as a tool for manipulating HTML.

This is a .NET jQuery port, it provides you with complete access to HTML via the same methods you would use on the client (a DOM and jQuery's API). This makes it pretty easy to roll your own sanitizer.

Rick Strahl had a blog post recently about sanitizing HTML. He showed how to do it with his rules using HTML Agility Pack, I posted a comment there showing how to achieve the same thing more easily with CsQuery. The basics are just this, given an enumeration of tags BlackList:

CQ doc = CQ.Create(html);

// creates a grouped selector "iframe,form,script, ..."
string selector = String.Join(",",BlackList); 

// CsQuery uses the property indexer as a default method, it's identical 
// to the "Select" method and functions like $(...)

doc[selector].Remove();

If you don't want to actually remove content in some tags, e.g. perhaps formatting tags you wish to prohibit, you can use jQuery's unwrap instead. This would have the effect of removing a tag but preserving its children.

doc[selector].UnWrap();

When you're done:

string cleanHtml = doc.Render();

There's more at Ricks' post for cleaning up javascript event attributes and so on, but basically CsQuery is a toolbox with a familiar and simple way to manipulate HTML. It should be easy enough to create a sanitizer that works in the way you want.

CsQuery's DOM model also contains methods to access the styles directly (e.g. in a more convenient way than just manipulating the string), if you need to do something like remove certain named styles. For example you could remove the "font-weight" style from all elements:

// use the [attribute] selector to target only elements with styles

foreach (IDomObject element in doc["[style]"]) {
    if (element.HasStyle("font-weight")) {
        element.RemoveStyle("font-weight");
    }
}

The major shortcoming of CsQuery right now is documentation. It's API is designed to match the browser DOM and jQuery as closely as possible (given language differnces between jQuery and C#), and the public API is well commented, so it should be easy enough to code against once you get going.

But there are a handful of nonstandard methods (like "HasStyle" and "RemoveStyle" that) are unique to CsQuery. Basic usage is covered pretty well in the readme on github, though. It's also on Nuget as CsQuery.

like image 30
Jamie Treworgy Avatar answered Oct 21 '22 02:10

Jamie Treworgy