Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove everything within script and style tags

I have a variable named $articleText and it contains html code. There are script and style codes within <script> and <style> html elements. I want to scan the $articleText and remove these pieces of code. If I can also remove the actual html elements <script>, </script>, <style> and </style>, I would do that too.

I imagine I need to be using regex however I am not skilled in it.

Can anyone assist?

I wish I could provide some code but like I said I am not skilled in regex so I don't have anything to show.

I cannot use DOM. I need specifically to use regex against these specific tags

like image 400
somejkuser Avatar asked Nov 19 '13 21:11

somejkuser


2 Answers

Do not use RegEx on HTML. PHP provides a tool for parsing DOM structures, called appropriately DomDocument.

<?php
// some HTML for example
$myHtml = '<html><head><script>alert("hi mom!");</script></head><body><style>body { color: red;} </style><h1>This is some content</h1><p>content is awesome</p></body><script src="someFile.js"></script></html>';

// create a new DomDocument object
$doc = new DOMDocument();

// load the HTML into the DomDocument object (this would be your source HTML)
$doc->loadHTML($myHtml);

removeElementsByTagName('script', $doc);
removeElementsByTagName('style', $doc);
removeElementsByTagName('link', $doc);

// output cleaned html
echo $doc->saveHtml();

function removeElementsByTagName($tagName, $document) {
  $nodeList = $document->getElementsByTagName($tagName);
  for ($nodeIdx = $nodeList->length; --$nodeIdx >= 0; ) {
    $node = $nodeList->item($nodeIdx);
    $node->parentNode->removeChild($node);
  }
}

You can try it here: https://eval.in/private/4f225fa0dcb4eb

Documentation

  • DomDocument - http://php.net/manual/en/class.domdocument.php
  • DomNodeList - http://php.net/manual/en/class.domnodelist.php
  • DomDocument::getElementsByTagName - http://us3.php.net/manual/en/domdocument.getelementsbytagname.php
like image 50
Chris Baker Avatar answered Oct 08 '22 07:10

Chris Baker


Even regex is not a good tool for this kind of task, for small simple task it may work.


If you want to remove just inner text of tag(s), use:

preg_replace('/(<(script|style)\b[^>]*>).*?(<\/\2>)/is', "$1$3", $txt);

See demo here.

If you want to remove also tags, replacement string in the above code would be empty, so just "".

like image 29
Ωmega Avatar answered Oct 08 '22 06:10

Ωmega