Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

preg_split mixed HTML and PHP tags except in quotes and comments

I have a php page mixed with HTML. Some example code:

<?php echo "<p>some text</p>"; ?>/* <? some php in comments ?> */
<p>some HTML text</p> <!-- <h1>some HTML in comments</h1> -->
<? $header_info = <<<END 
\$some="<?php @ob_start(); @session_set_save_handler(); ?>";
END; ?>
<h2>Some more HTML</h2>

I would like to split at each PHP and HTML tag but leave any PHP tags or HTML tags in quotes or comments untouched/ignored. This is what I have so far:

$array = preg_split("/((^<\?php)|([^'|\"]<\?php)|([^'|\"]<\?)|([^'|\"]\?>)|(<\%)|(\%>))/i", $string, -1);

The issue I have is that some of the HTML closing brackets '>' are missing in the final $array. I would like to keep the HTML open and closing tags intact. Sometimes I end up with

<p></p instead of <p></p> 

It should look like this:

[0] echo "<p>some text</p>";  
[1] <p>some HTML text</p> 
[2] $header_info = <<<END 
\$some="<?php @ob_start(); @session_set_save_handler(); ?>";
END; 
[3] <h2>Some more HTML</h2>

Any comments do not need to be part of the array as long as preg_split does not see them as any delimiters and disregards any of them.

I also just realized that some of the php tags, especially when using eval() can end up like this:

"?> <p>some HTML text</p> <?";

which would mean that the quotations in my regex would not match any of those cases.

Preg_match() might be a better option, not sure though.

Any help would be very much appreciated as I am not very ingenious when it comes to regex and am rather stuck at this point.

Thanks a lot :)

like image 977
user1862374 Avatar asked Nov 12 '22 16:11

user1862374


1 Answers

PREAMBLE
Since a regular expression solution was asked, the following solution will rely on regular expressions. However, in this particular case, a PHP parser would be more suited.

Regular Expression

#(?<!"|\')<\\?(?:php)?\\s+(.+?)\\?>(?!"|\')|/\*.+\*/|<!--.+-->#is

Scriptlet

$subject = '<?php echo "<p>some text</p>"; ?>/* <? some php in comments ?> */
<p>some HTML text</p> <!-- <h1>some HTML in comments</h1> -->
<? $header_info = <<<END
\\$some="<?php @ob_start(); @session_set_save_handler(); ?>";
END; ?>
<h2>Some more HTML</h2>';

$returnValue = preg_replace('#(?<!"|\')<\\?(?:php)?\\s+(.+?)\\?>(?!"|\')|/\*.+\*/|<!--.+-->#is', '$1', $subject, -1);

var_dump(preg_split('#\\r?\\n#s', $returnValue));

Result

array(6) {
  [0]=>
  string(25) "echo "<p>some text</p>"; "
  [1]=>
  string(22) "<p>some HTML text</p> "
  [2]=>
  string(21) "$header_info = <<<END"
  [3]=>
  string(60) "\$some="<?php @ob_start(); @session_set_save_handler(); ?>";"
  [4]=>
  string(5) "END; "
  [5]=>
  string(23) "<h2>Some more HTML</h2>"
}

DEMO
http://sandbox.onlinephpfunctions.com/code/017a51877b50f272f151feade7b59e142757481e

Discussion

1. # 
2. (?<!"|\')
3. <\\?(?:php)?\\s+
4. (.+?)
5. \\?>
6. (?!"|\')
7. |/\*.+\*/
8. |<!--.+-->
9. #is

line 1 I use this regex delimiter since it permits avoiding the escape of /
line 2 Here is the key of the regex. A negative lookbehind is used to ensure that the next opening php tag is NOT preceded by any single or double quote.
line 3 Here is defined what an opening php tag is. To support ASP tags too, this line can be changed like this : <\\?(?:php|%)?\\s+
line 4 Since we have detected the start of a php code sequence, we match any char appeaing in this php code sequence. Note on line 9 we use the s flag to indicate that we want new lines as well in php code sequence.
line 5 We mark the end of php code sequence.
line 6 We ensure that the preceding matched php tag is not followed by any single/double quote with the negative lookahead assertion.
line 7,8 If we find some php/HTML comment, they will be simply ignored.
line 9 End f regex.

Known issues

  • After executing the regex on $subject, the lines are simply splitted with a newline (preceded by an optional carriage return) delimiter.
  • No effort is made to handle php heredoc or newdoc syntaxes.
  • This regex should NOT be viewed as a bulletproof regex against any php code in the wild. PHP parsers are far more suited.
like image 76
Stephan Avatar answered Nov 15 '22 04:11

Stephan