I been finding that for something that I consider pretty import there is very little information or libraries on how to deal with this problem.
I found this while searching. I really don't know all the million ways that a hacker could try to insert the dangerous tags.
I have a rich html editor so I need to keep non dangerous tags but strip out bad ones.
So is this script missing anything?
It uses html agility pack.
public string ScrubHTML(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Remove potentially harmful elements
HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.ParentNode.RemoveChild(node, false);
}
}
//remove hrefs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("href", "#");
}
}
//remove img with refs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("src", "#");
}
}
//remove on<Event> handlers from all tags
nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("onFocus");
node.Attributes.Remove("onBlur");
node.Attributes.Remove("onClick");
node.Attributes.Remove("onMouseOver");
node.Attributes.Remove("onMouseOut");
node.Attributes.Remove("onDoubleClick");
node.Attributes.Remove("onLoad");
node.Attributes.Remove("onUnload");
}
}
// remove any style attributes that contain the word expression (IE evaluates this as script)
nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("stYle");
}
}
return doc.DocumentNode.WriteTo();
}
Edit
2 people have suggested whitelisting. I actually like the idea of whitelisting but never actually did it because no one can actually tell me how to do it in C# and I can't even really find tutorials for how to do it in c#(the last time I looked. I will check it out again).
How do you make a white list? Is it just a list collection?
How do you actual parse out all html tags, script tags and every other tag?
Once you have the tags how do you determine which ones are allowed? Compare them to you list collection? But what happens if the content is coming in and has like 100 tags and you have 50 allowed. You got to compare each of those 100 tag by 50 allowed tags. Thats quite a bit to go through and could be slow.
Once you found a invalid tag how do you remove it? I don't really want to reject a whole set of text if one tag was found to be invalid. I rather remove and insert the rest.
Should I be using html agility pack?
That code is dangerous -- you should be whitelisting elements, not blacklisting them.
In other words, make a small list of tags and attributes you want to allow, and don't let any others through.
EDIT: I'm not familiar with HTML agility pack, but I see no reason why it wouldn't work for this. Since I don't know the framework, I'll give you pseudo-code for what you need to do.
doc.LoadHtml(html);
var validTags = new List<string>(new string[] {"b", "i", "u", "strong", "em"});
var nodes = doc.DocumentNode.SelectAllNodes();
foreach(HtmlNode node in nodes)
if(!validTags.Contains(node.Tag.ToLower()))
node.Parent.ReplaceNode(node, node.InnerHtml);
Basically, for each tag, if it's not contained in the whitelist, replace the tag with just its inner HTML. Again, I don't know your framework, so I can't really give you specifics, sorry. Hopefully this gets you started in the right direction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With