Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match the nth longest possible string in Perl

The pattern matching quantifiers of a Perl regular expression are "greedy" (they match the longest possible string). To force the match to be "ungreedy", a ? can be appended to the pattern quantifier (*, +).

Here is an example:

#!/usr/bin/perl

$string="111s11111s";

#-- greedy match
$string =~ /^(.*)s/;
print "$1\n"; # prints 111s11111

#-- ungreedy match
$string =~ /^(.*?)s/;
print "$1\n"; # prints 111

But how one can find the second, third and .. possible string match in Perl? Make a simple example of yours --if need a better one.

like image 351
Ali Abbasinasab Avatar asked Jun 28 '14 01:06

Ali Abbasinasab


1 Answers

Utilize a conditional expression, a code expression, and backtracking control verbs.

my $skips = 1;
$string =~ /^(.*)s(?(?{$skips-- > 0})(*FAIL))/;

The above will use greedy matching, but will cause the largest match to intentionally fail. If you wanted the 3rd largest, you could just set the number of skips to 2.

Demonstrated below:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "111s11111s11111s";

$string =~ /^(.*)s/;
print "Greedy match     - $1\n";

$string =~ /^(.*?)s/;
print "Ungreedy match   - $1\n";

my $skips = 1;
$string =~ /^(.*)s(?(?{$skips-- > 0})(*FAIL))/;
print "2nd Greedy match - $1\n";

Outputs:

Greedy match     - 111s11111s11111
Ungreedy match   - 111
2nd Greedy match - 111s11111

When using such advanced features, it is important to have a full understanding of regular expressions to predict the results. This particular case works because the regex is fixed on one end with ^. That means that we know that each subsequent match is also one shorter than the previous. However, if both ends could shift, we could not necessarily predict order.

If that were the case, then you find them all, and then you sort them:

use strict;
use warnings;

my $string = "111s11111s";

my @seqs;
$string =~ /^(.*)s(?{push @seqs, $1})(*FAIL)/;

my @sorted = sort {length $b <=> length $a} @seqs;

use Data::Dump;
dd @sorted;

Outputs:

("111s11111s11111", "111s11111", 111)

Note for Perl versions prior to v5.18

Perl v5.18 introduced a change, /(?{})/ and /(??{})/ have been heavily reworked, that enabled the scope of lexical variables to work properly in code expressions as utilized above. Before then, the above code would result in the following errors, as demonstrated in this subroutine version run under v5.16.2:

Variable "$skips" will not stay shared at (re_eval 1) line 1.
Variable "@seqs" will not stay shared at (re_eval 2) line 1.

The fix for older implementations of RE code expressions is to declare the variables with our, and for further good coding practices, to localize them when initialized. This is demonstrated in this modified subroutine version run under v5.16.2, or as put below:

local our @seqs;
$string =~ /^(.*)s(?{push @seqs, $1})(*FAIL)/;
like image 86
Miller Avatar answered Nov 15 '22 03:11

Miller