Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match from last occurrence using regex in perl

Tags:

regex

perl

I have a text like this:

hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of 
/* any string */ select this part on
ly 
////RESULT rest string

The text is multilined and I need to extract from last occurrence of "*/" until "////RESULT". In this case, the result should be:

 select this part on
ly 

How to achieve this in perl?

I have attempted \\\*/(.|\n)*////RESULT but that will start from first "*/"

like image 248
Peiti Li Avatar asked Jan 02 '13 18:01

Peiti Li


1 Answers

A useful trick in cases like this is to prefix the regexp with the greedy pattern .*, which will try to match as many characters as possible before the rest of the pattern matches. So:

my ($match) = ($string =~ m!^.*\*/(.*?)////RESULT!s);

Let's break this pattern into its components:

  • ^.* starts at the beginning of the string and matches as many characters as it can. (The s modifier allows . to match even newlines.) The beginning-of-string anchor ^ is not strictly necessary, but it ensures that the regexp engine won't waste too much time backtracking if the match fails.

  • \*/ just matches the literal string */.

  • (.*?) matches and captures any number of characters; the ? makes it ungreedy, so it prefers to match as few characters as possible in case there's more than one position where the rest of the regexp can match.

  • Finally, ////RESULT just matches itself.

Since the pattern contains a lot of slashes, and since I wanted to avoid leaning toothpick syndrome, I decided to use alternative regexp delimiters. Exclamation points (!) are a popular choice, since they don't collide with any normal regexp syntax.


Edit: Per discussion with ikegami below, I guess I should note that, if you want to use this regexp as a sub-pattern in a longer regexp, and if you want to guarantee that the string matched by (.*?) will never contain ////RESULT, then you should wrap those parts of the regexp in an independent (?>) subexpression, like this:

my $regexp = qr!\*/(?>(.*?)////RESULT)!s;
...
my $match = ($string =~ /^.*$regexp$some_other_regexp/s);

The (?>) causes the pattern inside it to fail rather than accepting a suboptimal match (i.e. one that extends beyond the first substring matching ////RESULT) even if that means that the rest of the regexp will fail to match.

like image 108
Ilmari Karonen Avatar answered Oct 31 '22 02:10

Ilmari Karonen