Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing all script tags from html with JS Regular Expression

I want to strip script tags out of this HTML at Pastebin:

http://pastebin.com/mdxygM0a

I tried using the below regular expression:

html.replace(/<script.*>.*<\/script>/ims, " ")

But it does not remove all of the script tags in the HTML. It only removes in-line scripts. I'm looking for some regex that can remove all of the script tags (in-line and multi-line). It would be highly appreciated if a test is carried out on my sample http://pastebin.com/mdxygM0a

like image 411
Kennedy Avatar asked Jul 12 '11 04:07

Kennedy


People also ask

How to remove script tags from HTML using JavaScript?

Approach: Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.

How do I remove a script tag dynamically?

Dynamically removing an external JavaScript or CSS file To remove an external JavaScript or CSS file from a page, the key is to hunt them down first by traversing the DOM, then call DOM's removeChild() method to do the hit job.

How do I close a script in HTML?

Also to close an any element tag you must put the slash in front of the elements name. </script>, </html> etc.

How do I remove a specific script tag in jQuery?

Assuming the <script> tags are actually in your html DOM and you have jQuery reference included. You can use . filter() to get the script with src as 'http://firstScript.com' and the . remove() .


4 Answers

jQuery uses a regex to remove script tags in some cases and I'm pretty sure its devs had a damn good reason to do so. Probably some browser does execute scripts when inserting them using innerHTML.

Here's the regex:

/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi

And before people start crying "but regexes for HTML are evil": Yes, they are - but for script tags they are safe because of the special behaviour - a <script> section may not contain </script> at all unless it should end at this position. So matching it with a regex is easily possible. However, from a quick look the regex above does not account for trailing whitespace inside the closing tag so you'd have to test if </script    etc. will still work.

like image 125
ThiefMaster Avatar answered Oct 18 '22 05:10

ThiefMaster


Attempting to remove HTML markup using a regular expression is problematic. You don't know what's in there as script or attribute values. One way is to insert it as the innerHTML of a div, remove any script elements and return the innerHTML, e.g.

  function stripScripts(s) {
    var div = document.createElement('div');
    div.innerHTML = s;
    var scripts = div.getElementsByTagName('script');
    var i = scripts.length;
    while (i--) {
      scripts[i].parentNode.removeChild(scripts[i]);
    }
    return div.innerHTML;
  }

alert(
 stripScripts('<span><script type="text/javascript">alert(\'foo\');<\/script><\/span>')
);

Note that at present, browsers will not execute the script if inserted using the innerHTML property, and likely never will especially as the element is not added to the document.

like image 44
RobG Avatar answered Oct 18 '22 05:10

RobG


Regexes are beatable, but if you have a string version of HTML that you don't want to inject into a DOM, they may be the best approach. You may want to put it in a loop to handle something like:

<scr<script>Ha!</script>ipt> alert(document.cookie);</script>

Here's what I did, using the jquery regex from above:

var SCRIPT_REGEX = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi;
while (SCRIPT_REGEX.test(text)) {
    text = text.replace(SCRIPT_REGEX, "");
}
like image 49
Conrad Damon Avatar answered Oct 18 '22 07:10

Conrad Damon


This Regex should work too:

<script(?:(?!\/\/)(?!\/\*)[^'"]|"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\/\/.*(?:\n)|\/\*(?:(?:.|\s))*?\*\/)*?<\/script>

It even allows to have "problematic" variable strings like these inside:

<script type="text/javascript">
   var test1 = "</script>";
   var test2 = '\'</script>';
   var test1 = "\"</script>";
   var test1 = "<script>\"";
   var test2 = '<scr\'ipt>';
   /* </script> */
   // </script>
   /* ' */
   // var foo=" '
</script>

It seams that jQuery and Prototype fail on these ones...

Edit July 31 '17: Added a) non-capturing groups for better performance (and no empty groups) and b) support for JavaScript comments.

like image 17
spaark Avatar answered Oct 18 '22 05:10

spaark