Basically, what I'm trying to do is search through a rather large PHP file, and replace any block of PHP code that includes the string "search_term" somewhere in it with some other code. I.e.
<?php
//some stuff
?>
<?php
// some more stuff
$str = "search_term";
// yes...
?>
<?php
// last stuff
?>
should become
<?php
//some stuff
?>
HELLO
<?php
// last stuff
?>
What I've got so far is
$string =~ s/<\?php(.*?)search_term(.*?)\?>/HELLO/ims;
This correctly matches the closest closing ?>
, but begins the match at the very first <?php
, instead of the one closest to the string search_term
.
What am I doing wrong?
Generally, I don't like to use non-greedy matching, because it usually leads to problems like this. Perl looks at your file, finds the first '<?php'
, then starts looking for the rest of the regexp. It passes over the first '?>'
and the second '<?php'
because they match .*
, then finds search_term
and the next '?>'
, and it's done.
Non-greedy matching means that you have a regular expression that matches more things than you really want, and it leaves it up to perl to decide which match to return. It's better to use a regular expression that matches exactly what you want to match. In this case, you can get what you want by using ((?!\?>).)*
instead of .*?
((?!\?>)
is a negative look-ahead assertion)
s/<\?php((?!\?>).)*search_term((?!\?>).)*\?>/HELLO/is;
If you expect multiple matches, you might want to use /isg
rather than /is
.
Alternatively, just split the file into blocks:
@blocks = split /(\?>)/, $string;
while (@blocks) {
$block = shift @blocks;
$sep = shift @blocks;
if ($block=~/search_term/) {
print "HELLO";
} else {
print $block, $sep;
}
}
You just need to put your first capture group back into your replacement. Something like this:
s/<\?php(.*)<\?php(.*?)search_term(.*?)\?>/<\?php$1HELLO/ims
$string =~ s/<\?php(?:(?!\?>|search_term).)*search_term.*?\?>/HELLO/isg;
(?:(?!\?>|search_term).)*
matches one character at a time, after making sure the character isn't the beginning of ?>
or search_term
. When that stops matching, if the next thing in the string is search_term
it consumes that and everything after it until the next ?>
. Otherwise, that attempt fails and it starts over at the next <?php
.
The crucial point is that, like @RobertYoung's solution, it's not allowed to match ?>
as it searches for search_term
. By not matching search_term
either, it eliminates backtracking, which makes the search more efficient. Depending on the size of the source string that may not matter, but it won't noticeably hurt performance either.
@Benj's solution (as currently posted) does not work. It yields the desired output with the sample string you provided, but that's only by accident. It only replaces the last code block with search_term
in it, and (as @mob commented) it completely ignores the contents of the very first code block.
s/(.*)<\?php.*?search_term.*?\?>/${1}HELLO/ims;
In your regular expression, the regex engine is trying to find the earliest occurence of a substring that matches your target expression, and it finds it between the first <?php
and the second ?>
.
By putting (.*)
at the start of the regex, you trick the regex engine into going to the end of the string (since .*
matches the whole string), and then backtracking to spots where it can find the string "<?php
". That way the resulting match won't include any more <?php
tokens than necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With