Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matching the closest strings to a search term (perl regex)

Basically, what I'm trying to do is search through a rather large PHP file, and replace any block of PHP code that includes the string "search_term" somewhere in it with some other code. I.e.

<?php
//some stuff
?>
<?php
// some more stuff
$str = "search_term";
// yes...
?>
<?php 
// last stuff
?>

should become

<?php
//some stuff
?>
HELLO
<?php 
// last stuff
?>

What I've got so far is

$string =~ s/<\?php(.*?)search_term(.*?)\?>/HELLO/ims;

This correctly matches the closest closing ?>, but begins the match at the very first <?php, instead of the one closest to the string search_term.

What am I doing wrong?

like image 997
Mala Avatar asked May 11 '12 21:05

Mala


4 Answers

Generally, I don't like to use non-greedy matching, because it usually leads to problems like this. Perl looks at your file, finds the first '<?php', then starts looking for the rest of the regexp. It passes over the first '?>' and the second '<?php' because they match .*, then finds search_term and the next '?>', and it's done.

Non-greedy matching means that you have a regular expression that matches more things than you really want, and it leaves it up to perl to decide which match to return. It's better to use a regular expression that matches exactly what you want to match. In this case, you can get what you want by using ((?!\?>).)* instead of .*? ((?!\?>) is a negative look-ahead assertion)

s/<\?php((?!\?>).)*search_term((?!\?>).)*\?>/HELLO/is;

If you expect multiple matches, you might want to use /isg rather than /is.

Alternatively, just split the file into blocks:

@blocks = split /(\?>)/, $string;
while (@blocks) {
    $block = shift @blocks;
    $sep = shift @blocks;
    if ($block=~/search_term/) {
        print "HELLO";
    } else {
        print $block, $sep;
    }
}
like image 103
Robert Young Avatar answered Nov 02 '22 09:11

Robert Young


You just need to put your first capture group back into your replacement. Something like this:

s/<\?php(.*)<\?php(.*?)search_term(.*?)\?>/<\?php$1HELLO/ims
like image 41
Benj Avatar answered Nov 02 '22 09:11

Benj


$string =~ s/<\?php(?:(?!\?>|search_term).)*search_term.*?\?>/HELLO/isg;

(?:(?!\?>|search_term).)* matches one character at a time, after making sure the character isn't the beginning of ?> or search_term. When that stops matching, if the next thing in the string is search_term it consumes that and everything after it until the next ?>. Otherwise, that attempt fails and it starts over at the next <?php.

The crucial point is that, like @RobertYoung's solution, it's not allowed to match ?> as it searches for search_term. By not matching search_term either, it eliminates backtracking, which makes the search more efficient. Depending on the size of the source string that may not matter, but it won't noticeably hurt performance either.

@Benj's solution (as currently posted) does not work. It yields the desired output with the sample string you provided, but that's only by accident. It only replaces the last code block with search_term in it, and (as @mob commented) it completely ignores the contents of the very first code block.

like image 33
Alan Moore Avatar answered Nov 02 '22 09:11

Alan Moore


s/(.*)<\?php.*?search_term.*?\?>/${1}HELLO/ims;

In your regular expression, the regex engine is trying to find the earliest occurence of a substring that matches your target expression, and it finds it between the first <?php and the second ?>.

By putting (.*) at the start of the regex, you trick the regex engine into going to the end of the string (since .* matches the whole string), and then backtracking to spots where it can find the string "<?php". That way the resulting match won't include any more <?php tokens than necessary.

like image 20
mob Avatar answered Nov 02 '22 10:11

mob