Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting multiple matches within a string using regex in Perl

After having read this similar question and having tried my code several times, I keep on getting the same undesired output.

Let's assume the string I'm searching is "I saw wilma yesterday". The regex should capture each word followed by an 'a' and its optional 5 following characters or spaces.

The code I wrote is the following:

$_ = "I saw wilma yesterday";

if (@m = /(\w+)a(.{5,})?/g){
    print "found " . @m . " matches\n";

    foreach(@m){
        print "\t\"$_\"\n";
    }
}

However, I kept on getting the following output:

found 2 matches
    "s"
    "w wilma yesterday"

while I expected to get the following one:

found 3 matches:
    "saw wil"
    "wilma yest"
    "yesterday"

until I found out that the return values inside @m were $1 and $2, as you can notice.

Now, since the /g flag is on, and I don't think the problem is about the regex, how could I get the desired output?

like image 914
Acsor Avatar asked Jul 10 '13 20:07

Acsor


2 Answers

You can try this pattern that allows overlapped results:

(?=\b(\w+a.{1,5}))

or

(?=(?i)\b([a-z]+a.{0,5}))

example:

use strict;
my $str = "I saw wilma yesterday";
my @matches = ($str =~ /(?=\b([a-z]+a.{0,5}))/gi);
print join("\n", @matches),"\n";

more explanations:

You can't have overlapped results with a regex since when a character is "eaten" by the regex engine it can't be eaten a second time. The trick to avoid this constraint, is to use a lookahead (that is a tool that only checks, but not matches) which can run through the string several times, and put a capturing group inside.

For another example of this behaviour, you can try the example code without the word boundary (\b) to see the result.

like image 63
Casimir et Hippolyte Avatar answered Sep 26 '22 02:09

Casimir et Hippolyte


Firstly you want to capture everything inside the expression, i.e.:

/(\w+a(?:.{5,})?)/

Next you want to start your search from one character past where the last expression's first character matched.

The pos() function allows you to specify where a /g regex starts its search from.

like image 31
PP. Avatar answered Sep 26 '22 02:09

PP.