Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping HTML Comments With PHP But Leaving Conditionals

I'm currently using PHP and a regular expression to strip out all HTML comments from a page. The script works well... a little too well. It strips out all comments including my conditional comments in the . Here's what I've got:

<?php
  function callback($buffer)
  {
        return preg_replace('/<!--(.|\s)*?-->/', '', $buffer);
  }

  ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>

Since my regex isn't too hot I'm having trouble trying to figure out how to modify the pattern to exclude Conditional comments such as:

<!--[if !IE]><!-->
<link rel="stylesheet" href="/css/screen.css" type="text/css" media="screen" />
<!-- <![endif]-->

<!--[if IE 7]>
<link rel="stylesheet" href="/css/ie7.css" type="text/css" media="screen" />
<![endif]-->

<!--[if IE 6]>
<link rel="stylesheet" href="/css/ie6.css" type="text/css" media="screen" />
<![endif]-->

Cheers

like image 679
ianyoung Avatar asked Jun 18 '09 15:06

ianyoung


3 Answers

Since comments cannot be nested in HTML, a regex can do the job, in theory. Still, using some kind of parser would be the better choice, especially if your input is not guaranteed to be well-formed.

Here is my attempt at it. To match only normal comments, this would work. It has become quite a monster, sorry for that. I have tested it quite extensively, it seems to do it well, but I give no warranty.

<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->

Explanation:

<!--                #01: "<!--"
(?!                 #02: look-ahead: a position not followed by:
  \s*               #03:   any number of space
  (?:               #04:   non-capturing group, any of:
    \[if [^\]]+]    #05:     "[if ...]"
    |<!             #06:     or "<!"
    |>              #07:     or ">"
  )                 #08:   end non-capturing group
)                   #09: end look-ahead
(?:                 #10: non-capturing group:
  (?!-->)           #11:   a position not followed by "-->"
  .                 #12:   eat the following char, it's part of the comment
)*                  #13: end non-capturing group, repeat
-->                 #14: "-->"

Steps #02 and #11 are crucial. #02 makes sure that the following characters do not indicate a conditional comment. After that, #11 makes sure that the following characters do not indicate the end of the comment, while #12 and #13 cause the actual matching.

Apply with "global" and "dotall" flags.

To do the opposite (match only conditional comments), it would be something like this:

<!(--)?(?=\[)(?:(?!<!\[endif\]\1>).)*<!\[endif\]\1>

Explanation:

<!                  #01: "<!"
(--)?               #02: two dashes, optional
(?=\[)              #03: a position followed by "["
(?:                 #04: non-capturing group:
  (?!               #05:   a position not followed by
    <!\[endif\]\1>  #06:     "<![endif]>" or "<![endif]-->" (depends on #02)
  )                 #07:   end of look-ahead
  .                 #08:   eat the following char, it's part of the comment
)*                  #09: end of non-capturing group, repeat
<!\[endif\]\1>      #10: "<![endif]>" or "<![endif]-->" (depends on #02)

Again, apply with "global" and "dotall" flags.

Step #02 is because of the "downlevel-revealed" syntax, see: "MSDN - About Conditional Comments".

I'm not entirely sure where spaces are allowed or expected. Add \s* to the expression where appropriate.

like image 142
Tomalak Avatar answered Nov 06 '22 21:11

Tomalak


If you can't get it to work with one regular expression or you find you want to preserve more comments you could use preg_replace_callback. You can then define a function to handle the comments individually.

<?php
function callback($buffer) {
    return preg_replace_callback('/<!--.*-->/U', 'comment_replace_func', $buffer);
}

function comment_replace_func($m) {
    if (preg_match( '/^\<\!--\[if \!/i', $m[0])) {
        return $m[0];   
    }              

    return '';
}   

ob_start("callback");
?>

... HTML source goes here ...

<?php ob_end_flush(); ?>
like image 45
Tom Haigh Avatar answered Nov 06 '22 22:11

Tom Haigh


In summary this seems to be the best solution:

<?php
  function callback($buffer) {
    return preg_replace('/<!--[^\[](.|\s)*?-->/', '', $buffer);
  }
  ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>

It strips out all comments and leaves conditionals with the exception of the top one:

<!--[if !IE]><!-->
<link rel="stylesheet" href="/css/screen.css" type="text/css" media="screen" />
<!-- <![endif]-->

where the additional seems to be causing the problem.

If anyone can suggest the regex which would take this into account and leave that condtional in place too then that would be perfect.

Tomalak's solution looks good but as a newbie and no further guidelines I don't know how to implement it although I would like to try it if anyone can elaborate on how to apply it?

Thanks

like image 1
ianyoung Avatar answered Nov 06 '22 20:11

ianyoung