<p>Based on the <code>strip_tags</code> documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the <code>script_tags</code> normally (default) accept, but strip only the <code><script></code> tag. Any possible way for this?</p> <p><em>I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.</em></p>

<p><strong>EDIT</strong></p> <p>To use the HTML Purifier <code>HTML.ForbiddenElements</code> config directive, it seems you would do something like:</p> <pre class="prettyprint"><code>require_once '/path/to/HTMLPurifier.auto.php'; $config = HTMLPurifier_Config::createDefault(); $config->set('HTML.ForbiddenElements', array('script','style','applet')); $purifier = new HTMLPurifier($config); $clean_html = $purifier->purify($dirty_html); </code></pre> <p>http://htmlpurifier.org/docs</p> <p><code>HTML.ForbiddenElements</code> should be set to an <code>array</code>. What I don't know is what form the <code>array</code> members should take:</p> <pre class="prettyprint"><code>array('script','style','applet') </code></pre> <p>Or:</p> <pre class="prettyprint"><code>array('<script>','<style>','<applet>') </code></pre> <p>Or... Something else?</p> <p>I <em>think</em> it's the first form, without delimiters; <code>HTML.AllowedElements</code> uses a form of configuration string somewhat common to TinyMCE's <code>valid elements</code> syntax:</p> <pre class="prettyprint"><code>tinyMCE.init({ ... valid_elements : "a[href|target=_blank],strong/b,div[align],br", ... }); </code></pre> <p>So my guess is it's just the term, and no attributes should be provided (since you're <em>banning</em> the element... although there is a <code>HTML.ForbiddenAttributes</code>, too). But that's a guess.</p> <p>I'll add this note from the <code>HTML.ForbiddenAttributes</code> docs, as well:</p> <blockquote> <p><strong>Warning:</strong> This directive complements <code>%HTML.ForbiddenElements</code>, accordingly, check out that directive for a discussion of why you should think twice before using this directive.</p> </blockquote> <p>Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.</p> <p>Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. <code>:)</code></p> <hr> <p>Although I think you really should use HTML Purifier and utilize it's <code>HTML.ForbiddenElements</code> configuration directive, I think a reasonable alternative if you really, really want to use <code>strip_tags()</code> is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.</p> <p>For instance:</p> <pre class="prettyprint"><code>function blacklistElements($blacklisted = '', &$errors = array()) { if ((string)$blacklisted == '') { $errors[] = 'Empty string.'; return array(); } $html5 = array( "<menu>","<command>","<summary>","<details>","<meter>","<progress>", "<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>", "<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>", "<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>", "<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>", "<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>", "<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>", "<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>", "<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>", "<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>", "<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>", "<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>", "<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>", "<title>","<head>","<html>" ); $list = trim(strtolower($blacklisted)); $list = preg_replace('/[^a-z ]/i', '', $list); $list = '<' . str_replace(' ', '> <', $list) . '>'; $list = array_map('trim', explode(' ', $list)); return array_diff($html5, $list); } </code></pre> <p>Then run it:</p> <pre class="prettyprint"><code>$blacklisted = '<html> <bogus> <EM> em li ol'; $whitelist = blacklistElements($blacklisted); if (count($errors)) { echo "There were errors.\n"; print_r($errors); echo "\n"; } else { // Do strip_tags() ... } </code></pre> <p>http://codepad.org/LV8ckRjd</p> <p>So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an <code>array</code> form that you can then feed into <code>strip_tags()</code> after joining it into a string:</p> <pre class="prettyprint"><code>$stripped = strip_tags($html, implode('', $whitelist))); </code></pre> <p><em><strong>Caveat Emptor</strong></em></p> <p>Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the <code>strip_tags()</code> man page for the <code>$allowable_tags</code> argument:</p> <blockquote> <p><strong>Note:</strong></p> <p>This parameter should not contain whitespace. <code>strip_tags()</code> sees a tag as a case-insensitive string between <code><</code> and the first whitespace or <code>></code>. It means that <code>strip_tags("<br/>", "<br>")</code> returns an empty string.</p> </blockquote> <p>It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's <code>$html5</code> element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:</p> <pre class="prettyprint"><code><tagName> </code></pre> <p>I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <code><tagName/></code> and some of the, ahem, <em>odder variations</em>. And, of course, there are more tags out there.</p> <p>So it's probably not production ready. But you get the idea.</p>

<p>First, see what others have said on this topic:</p> <p>Strip <script> tags and everything in between with PHP?</p> <p>and</p> <p>remove script tag from HTML content</p> <p>It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.</p> <p>If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.</p>

strip_tags disallow some tags

Tags:

html

php

strip-tags

Based on the strip_tags documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the script_tags normally (default) accept, but strip only the <script> tag. Any possible way for this?

I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.

937

asked Sep 11 '12 03:09

Leandro Garcia

2 Answers

EDIT

To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:

require_once '/path/to/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);

http://htmlpurifier.org/docs

HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:

array('script','style','applet')

Or:

array('<script>','<style>','<applet>')

Or... Something else?

I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:

tinyMCE.init({
    ...
    valid_elements : "a[href|target=_blank],strong/b,div[align],br",
    ...
});

So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.

I'll add this note from the HTML.ForbiddenAttributes docs, as well:

Warning: This directive complements %HTML.ForbiddenElements, accordingly, check out that directive for a discussion of why you should think twice before using this directive.

Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.

Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)

Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.

For instance:

function blacklistElements($blacklisted = '', &$errors = array()) {
    if ((string)$blacklisted == '') {
        $errors[] = 'Empty string.';
        return array();
    }

    $html5 = array(
        "<menu>","<command>","<summary>","<details>","<meter>","<progress>",
        "<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
        "<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
        "<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
        "<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
        "<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
        "<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
        "<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
        "<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
        "<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
        "<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
        "<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
        "<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
        "<title>","<head>","<html>"
    );

    $list = trim(strtolower($blacklisted));
    $list = preg_replace('/[^a-z ]/i', '', $list);
    $list = '<' . str_replace(' ', '> <', $list) . '>';
    $list = array_map('trim', explode(' ', $list));

    return array_diff($html5, $list);
}

Then run it:

$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);

if (count($errors)) {
    echo "There were errors.\n";
    print_r($errors);
    echo "\n";
} else {
    // Do strip_tags() ...
}

http://codepad.org/LV8ckRjd

So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:

$stripped = strip_tags($html, implode('', $whitelist)));

Caveat Emptor

Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:

Note:

This parameter should not contain whitespace. strip_tags() sees a tag as a case-insensitive string between < and the first whitespace or >. It means that strip_tags("<br/>", "<br>") returns an empty string.

It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:

<tagName>

I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.

So it's probably not production ready. But you get the idea.

111

answered Sep 29 '22 09:09

Jared Farrish

First, see what others have said on this topic:

Strip <script> tags and everything in between with PHP?

and

remove script tag from HTML content

It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.

If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.

answered Sep 29 '22 09:09

Todd Moses

Related questions
                            
                                Using IN clause vs. multiple SELECTs
                            
                                Do spambots directly POST to server or fill out HTML forms? [closed]
                            
                                Modify an array by reference
                            
                                Catchable fatal error: Object of class PDOStatement could not be converted to string
                            
                                How to update a manyToMany collection of an entity in onFlush event listener?
                            
                                phpmyadmin show full sql output
                            
                                How to have a download counter attached to my HTML download button?
                            
                                EZPDF - Documentation, Tutorial, Anything? [closed]
                            
                                Yii restrict database connection to read-only
                            
                                Memcached - How it Works
                            
                                PHP: open_basedir allowed path
                            
                                Why is underscore converted to directory separator in the PSR-0 standard?
                            
                                Receive PHP parameters with jQuery ajax post
                            
                                Error: Invalid PathExpression. Must be a StateFieldPathExpression
                            
                                Form processing $_POST to variables automatically
                            
                                When -11.5 plus 11.5 equals float(2.8421709430404E-14) [duplicate]
                            
                                Symfony2 & SonataMedia: current field not linked to an admin
                            
                                Error #520009 - Account is restricted
                            
                                spamassassin rdns reversedns
                            
                                convert multiple instances of a character to only one (aa = a) using preg_replace?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With