Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove script tag from HTML content

I am using HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script> tags only. I don't want to remove inline formatting or any other things.

How can I achieve this?

One more thing, it there any other way to remove script tags from HTML

like image 514
I-M-JM Avatar asked Aug 20 '11 09:08

I-M-JM


People also ask

How do I remove the DOM script tag?

We can remove a script from the DOM by scanning through all scripts on the page, getting the parent node of that script, and then finally removing the child of that parent node.

Why do script tags appear in the body?

If your is not placed inside a function, or if your script writes page content, it should be placed in the body section. It is a good idea to place scripts at the bottom of the <body> element. This can improve page load, because script compilation can slow down the display.

Do script tags go in the body?

JavaScript in body or head: Scripts can be placed inside the body or the head section of an HTML page or inside both head and body.


2 Answers

Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html); 

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php  $html = <<<HTML ... HTML;  $dom = new DOMDocument();  $dom->loadHTML($html);  $script = $dom->getElementsByTagName('script');  $remove = []; foreach($script as $item) {   $remove[] = $item; }  foreach ($remove as $item) {   $item->parentNode->removeChild($item);  }  $html = $dom->saveHTML(); 

I have removed the HTML intentionally because even this can bork.

like image 61
Dejan Marjanović Avatar answered Oct 18 '22 04:10

Dejan Marjanović


Use the PHP DOMDocument parser.

$doc = new DOMDocument();  // load the HTML string we want to strip $doc->loadHTML($html);  // get all the script tags $script_tags = $doc->getElementsByTagName('script');  $length = $script_tags->length;  // for each tag, remove it from the DOM for ($i = 0; $i < $length; $i++) {   $script_tags->item($i)->parentNode->removeChild($script_tags->item($i)); }  // get the HTML string back $no_script_html_string = $doc->saveHTML(); 

This worked me me using the following HTML document:

<!doctype html> <html>     <head>         <meta charset="utf-8">         <title>             hey         </title>         <script>             alert("hello");         </script>     </head>     <body>         hey     </body> </html> 

Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

like image 32
Alex Avatar answered Oct 18 '22 05:10

Alex