My C# site allows users to submit HTML to be displayed on the site. I would like to limit the tags and attributes allowed for the HTML, but am unable to figure out how to do this in .net.
I've tried using Html Agility Pack, but I don't see how to modify the HTML, I can see how to go through the HTML and find certain data, but actually generating an output file is baffling me.
Does anyone have a good example for cleaning up HTML in .net? The agility pack might be the answer, but the documentation is lacking.
I would strongly recommend Microsoft's Anti-XSS Library for santizing input. It supports sanitizing html.
You should only accept well-formed HTML.
You can then use LINQ to XML to parse and modify it.
You can make a recursive function that takes an element from the user and returns a new element with a whitelisted set of tags and attributes.
For example:
//Maps allowed tags to allowed attributes for the tags.
static readonly Dictionary<string, string[]> AllowedTags = new Dictionary<string, string[]>(StringComparer.OrdinalIgnoreCase) {
{ "b", new string[0] },
{ "img", new string[] { "src", "alt" } },
//...
};
static XElement CleanElement(XElement dirtyElement) {
return new XElement(dirtyElem.Name,
dirtyElement.Elements
.Where(e => AllowedTags.ContainsKey(e.Name))
.Select<XElement, XElement>(CleanElement)
.Concat(
dirtyElement.Attributes
.Where(a => AllowedTags[dirtyElem.Name].Contains(a.Name, StringComparer.OrdinalIgnoreCase))
);
}
If you allow hyperlinks, make sure to disallow javascript:
urls; this code doesn't do that.
With HtmlAgilityPack you can remove unwanted tags from the input:
node.ParentNode.RemoveChild(node);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With