matching the closest strings to a search term (perl regex)

Question

Basically, what I'm trying to do is search through a rather large PHP file, and replace any block of PHP code that includes the string "search_term" somewhere in it with some other code. I.e.

<?php
//some stuff
?>
<?php
// some more stuff
$str = "search_term";
// yes...
?>
<?php 
// last stuff
?>

should become

<?php
//some stuff
?>
HELLO
<?php 
// last stuff
?>

What I've got so far is

$string =~ s/<\?php(.*?)search_term(.*?)\?>/HELLO/ims;

This correctly matches the closest closing ?>, but begins the match at the very first <?php, instead of the one closest to the string search_term.

What am I doing wrong?

Robert Young · Accepted Answer

Generally, I don't like to use non-greedy matching, because it usually leads to problems like this. Perl looks at your file, finds the first '<?php', then starts looking for the rest of the regexp. It passes over the first '?>' and the second '<?php' because they match .*, then finds search_term and the next '?>', and it's done.

Non-greedy matching means that you have a regular expression that matches more things than you really want, and it leaves it up to perl to decide which match to return. It's better to use a regular expression that matches exactly what you want to match. In this case, you can get what you want by using ((?!\?>).)* instead of .*? ((?!\?>) is a negative look-ahead assertion)

s/<\?php((?!\?>).)*search_term((?!\?>).)*\?>/HELLO/is;

If you expect multiple matches, you might want to use /isg rather than /is.

Alternatively, just split the file into blocks:

@blocks = split /(\?>)/, $string;
while (@blocks) {
    $block = shift @blocks;
    $sep = shift @blocks;
    if ($block=~/search_term/) {
        print "HELLO";
    } else {
        print $block, $sep;
    }
}

Benj · Answer

You just need to put your first capture group back into your replacement. Something like this:

s/<\?php(.*)<\?php(.*?)search_term(.*?)\?>/<\?php$1HELLO/ims

Alan Moore · Answer

$string =~ s/<\?php(?:(?!\?>|search_term).)*search_term.*?\?>/HELLO/isg;

(?:(?!\?>|search_term).)* matches one character at a time, after making sure the character isn't the beginning of ?> or search_term. When that stops matching, if the next thing in the string is search_term it consumes that and everything after it until the next ?>. Otherwise, that attempt fails and it starts over at the next <?php.

The crucial point is that, like @RobertYoung's solution, it's not allowed to match ?> as it searches for search_term. By not matching search_term either, it eliminates backtracking, which makes the search more efficient. Depending on the size of the source string that may not matter, but it won't noticeably hurt performance either.

@Benj's solution (as currently posted) does not work. It yields the desired output with the sample string you provided, but that's only by accident. It only replaces the last code block with search_term in it, and (as @mob commented) it completely ignores the contents of the very first code block.

mob · Answer

s/(.*)<\?php.*?search_term.*?\?>/${1}HELLO/ims;

In your regular expression, the regex engine is trying to find the earliest occurence of a substring that matches your target expression, and it finds it between the first <?php and the second ?>.

By putting (.*) at the start of the regex, you trick the regex engine into going to the end of the string (since .* matches the whole string), and then backtracking to spots where it can find the string "<?php". That way the resulting match won't include any more <?php tokens than necessary.

matching the closest strings to a search term (perl regex)

Tags:

regex

replace

perl

Mala

4 Answers

Robert Young

Benj

Alan Moore

mob

Recent Activity

Donate For Us

matching the closest strings to a search term (perl regex)

Tags:

regex

replace

perl

Mala

4 Answers

Robert Young

Benj

Alan Moore

mob

Related questions

Recent Activity

Donate For Us