Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove dangerous characters(ie script tags)?

I am wondering is there any sort of C# class or 3rd party library that removes dangerous characters such as script tags?

I know you can use regex but I also know people can write their script tags so many ways that you can fool the regex into thinking it is OK.

I also heard that HTML Agility Pack is good so I am wondering is there any script removal class made for it?

Edit

http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=24346

I found this on their forms. However I am not sure if this is complete solution as the guy does not have any tests to back it up and it would be nicer if this was on some site where tons of people where using this script every day to test to see if anything gets by.

Great example (almost), Thanks! A few ways to make it stronger that I saw, though:

1) Use case-insensitive search when looking for links with "javascript:", "vbscript:", "jscript:". For example, the original example would not remove the HTML:

<a href="JAVAscRipt:alert('hi')">click> me</a>

2) Remove any style attributes that contain an expression rule. Internet Explorer evaluates the CSS rule express as script. For example, the following would product a message box:

<div style="width:expression(alert('hi'));">bad> code</div>

3) Also remove tags

I honestly have no idea why "expression" has not been removed from IE - major flaw in my opinion. (Try the div example in internet explorer and you'll see why - even IE8.) I just wish there was an easier/standard way to clean-up html input from a user.

Here's the code updated with these improvements. Let me know if you see anything wrong:

    public string ScrubHTML(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        //Remove potentially harmful elements
        HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.ParentNode.RemoveChild(node, false);

            }
        }

        //remove hrefs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {

            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("href", "#");
            }
        }


        //remove img with refs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("src", "#");
            }
        }

        //remove on<Event> handlers from all tags
        nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("onFocus");
                node.Attributes.Remove("onBlur");
                node.Attributes.Remove("onClick");
                node.Attributes.Remove("onMouseOver");
                node.Attributes.Remove("onMouseOut");
                node.Attributes.Remove("onDoubleClick");
                node.Attributes.Remove("onLoad");
                node.Attributes.Remove("onUnload");
            }
        }

        // remove any style attributes that contain the word expression (IE evaluates this as script)
        nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("stYle");
            }
        }

        return doc.DocumentNode.WriteTo();
    } 
like image 413
chobo2 Avatar asked Jun 02 '10 22:06

chobo2


People also ask

How do I remove a script in HTML?

Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.

How do I remove a DOM script tag?

We can remove a script from the DOM by scanning through all scripts on the page, getting the parent node of that script, and then finally removing the child of that parent node.

How can I remove only the script tag in PHP?

HTML; $dom = new DOMDocument(); $dom->loadHTML($html); $script = $dom->getElementsByTagName('script'); $remove = []; foreach($script as $item) { $remove[] = $item; } foreach ($remove as $item) { $item->parentNode->removeChild($item); } $html = $dom->saveHTML();

What is script tagging?

Definition and Usage. The <script> tag is used to embed a client-side script (JavaScript). The <script> element either contains scripting statements, or it points to an external script file through the src attribute. Common uses for JavaScript are image manipulation, form validation, and dynamic changes of content.


1 Answers

We had the same problem: Users enter HTML and we want to display it inside our XHTML pages. Note that they enter HTML fragments and not complete documents. I did research on this back in 2010 using unit tests to test for many different cases.

Solution:

  1. Use Microsoft Anti-Cross Site Scripting Library to remove everything considered unsafe (mainly scripts). Note that this tool doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order.
  2. Use Tidy.Net to make create almost valid XHTML.
  3. Remove html, head and body tags that Tidy.Net tends to create.
  4. Remove extra line breaks that Tidy.Net creates inside "pre" tags.

This will remove all JS and create something that in most cases is valid XHTML fragments. It will also remove all style tags.

The tools I tried have these problems:

Microsoft Anti-Cross Site Scripting Library: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order. Unfortunately not customizable.

Tidy.Net: Creates extra line breaks inside pre tags. (Can be fixed manually after running the tool.)

TidyForNet: Unstable. Sometimes gives you "Assertion faild in blabla.c"

Tidy (C-DLL) COM wrapper made in VB6: Impractical to say the least. You have to register the COM DLL.

HtmlAgilityPack: Inserts extra line breaks occasionally. Removes line breaks from pre tags.

Majestic12 HTML-parser: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order.

AntiSamy.Net: Impractical in that it uses components written in J# which is obsolete. Due to this it cannot run in a 64 bit environment. On the plus side it is very customizable regarding which tags and attribute values to allow.

like image 200
Martin Ørding-Thomsen Avatar answered Nov 02 '22 00:11

Martin Ørding-Thomsen