I am wondering is there any sort of C# class or 3rd party library that removes dangerous characters such as script tags?
I know you can use regex but I also know people can write their script tags so many ways that you can fool the regex into thinking it is OK.
I also heard that HTML Agility Pack is good so I am wondering is there any script removal class made for it?
Edit
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=24346
I found this on their forms. However I am not sure if this is complete solution as the guy does not have any tests to back it up and it would be nicer if this was on some site where tons of people where using this script every day to test to see if anything gets by.
Great example (almost), Thanks! A few ways to make it stronger that I saw, though:
1) Use case-insensitive search when looking for links with "javascript:", "vbscript:", "jscript:". For example, the original example would not remove the HTML:
<a href="JAVAscRipt:alert('hi')">click> me</a>
2) Remove any style attributes that contain an expression rule. Internet Explorer evaluates the CSS rule express as script. For example, the following would product a message box:
<div style="width:expression(alert('hi'));">bad> code</div>
3) Also remove tags
I honestly have no idea why "expression" has not been removed from IE - major flaw in my opinion. (Try the div example in internet explorer and you'll see why - even IE8.) I just wish there was an easier/standard way to clean-up html input from a user.
Here's the code updated with these improvements. Let me know if you see anything wrong:
public string ScrubHTML(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Remove potentially harmful elements
HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.ParentNode.RemoveChild(node, false);
}
}
//remove hrefs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("href", "#");
}
}
//remove img with refs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("src", "#");
}
}
//remove on<Event> handlers from all tags
nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("onFocus");
node.Attributes.Remove("onBlur");
node.Attributes.Remove("onClick");
node.Attributes.Remove("onMouseOver");
node.Attributes.Remove("onMouseOut");
node.Attributes.Remove("onDoubleClick");
node.Attributes.Remove("onLoad");
node.Attributes.Remove("onUnload");
}
}
// remove any style attributes that contain the word expression (IE evaluates this as script)
nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("stYle");
}
}
return doc.DocumentNode.WriteTo();
}
Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.
We can remove a script from the DOM by scanning through all scripts on the page, getting the parent node of that script, and then finally removing the child of that parent node.
HTML; $dom = new DOMDocument(); $dom->loadHTML($html); $script = $dom->getElementsByTagName('script'); $remove = []; foreach($script as $item) { $remove[] = $item; } foreach ($remove as $item) { $item->parentNode->removeChild($item); } $html = $dom->saveHTML();
Definition and Usage. The <script> tag is used to embed a client-side script (JavaScript). The <script> element either contains scripting statements, or it points to an external script file through the src attribute. Common uses for JavaScript are image manipulation, form validation, and dynamic changes of content.
We had the same problem: Users enter HTML and we want to display it inside our XHTML pages. Note that they enter HTML fragments and not complete documents. I did research on this back in 2010 using unit tests to test for many different cases.
Solution:
This will remove all JS and create something that in most cases is valid XHTML fragments. It will also remove all style tags.
The tools I tried have these problems:
Microsoft Anti-Cross Site Scripting Library: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order. Unfortunately not customizable.
Tidy.Net: Creates extra line breaks inside pre tags. (Can be fixed manually after running the tool.)
TidyForNet: Unstable. Sometimes gives you "Assertion faild in blabla.c"
Tidy (C-DLL) COM wrapper made in VB6: Impractical to say the least. You have to register the COM DLL.
HtmlAgilityPack: Inserts extra line breaks occasionally. Removes line breaks from pre tags.
Majestic12 HTML-parser: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order.
AntiSamy.Net: Impractical in that it uses components written in J# which is obsolete. Due to this it cannot run in a 64 bit environment. On the plus side it is very customizable regarding which tags and attribute values to allow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With