Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect how many capture groups are in a Perl Regexp?

Tags:

regex

perl

I have a bunch of perl regexps in a script. I would like to know how many capture groups are in them. More precisely I'd like to know how many items would be added to the @- and @+ arrays if they matched before actually use them in a real match op.

An example:

'XXAB(CD)DE\FG\XX' =~ /(?i)x(ab)\(cd\)(?:de)\\(fg\\)x/
    and print "'@-', '@+'\n";

In this case the output is:

'1 2 11', '15 4 14'

So after matching I know that the 0th item is the matched part of the string, and there are two capture group expressions. Would it be possible to know right before the actual match?

I tried to concentrate onto the opening brackets. So I removed the '\\' patterns first to make easier to detect the escaped brackets. Then I removed '\(' strings. Then came '(?'. Now I can count the remaining opening brackets.

my $re = '(?i)x(ab)\(cd\)(?:de)\\\\(fg\\\\)x'; print "ORIG: '$re'\n";
'XXAB(CD)DE\FG\XX' =~ /$re/ and print "RE: '@-', '@+'\n";
$re =~ s/\\\\//g; print "\\\\: '$re'\n";
$re =~ s/\\\(//g; print "\\(: '$re'\n";
$re =~ s/\(\?//g; print "\\?: '$re'\n";
my $n = ($re =~ s/\(//g); print "n=$n\n";

Output:

ORIG: '(?i)x(ab)\(cd\)(?:de)\\(fg\\)x'
RE: '1 2 11', '15 4 14'
\\: '(?i)x(ab)\(cd\)(?:de)(fg)x'
\(: '(?i)x(ab)cd\)(?:de)(fg)x'
\?: 'i)x(ab)cd\):de)(fg)x'
n=2

So here I know that 2 capture groups are in this regexp. But maybe there is an easier way and this is definitely not complete (e.g. this treats (?<foo>...) and (?'foo'...) as a non-caputre groups).

Another way would be to dump the internal data structures of regcomp function. Maybe the package Regexp::Debugger could solve the issue, but I have no right to install packages in my environment.

Actually the regexps are keys to some ARRAY refs and I'd like to check if the referenced ARRAY contains the proper amount of values before actually applying the regexps. Of course this checking can be done right after the pattern matching, but it would be nicer if I could do it in the loading stage of the script.

Thank you for your help and comments in advance!

like image 564
TrueY Avatar asked Jan 19 '17 13:01

TrueY


Video Answer


2 Answers

Regex:

\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))

Explanation:

\\.                     # Match any escaped character
(*SKIP)(?!)             # Discard it
|                       # OR
\(                      # Match a single `(`
(?(?=\?)                # Which if is followed by `?`
    \?                      # Match `?`
    P?['<]\w+['>]           # Next characters should be matched as ?P'name', ?<name> or ?'name'
)                       # End of conditional statement

Perl:

my @offsets = ();
while ('XXAB(CD)DE\FG\X(X)' =~ /\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))/g){
    push @offsets, "$-[0]";
}
print join(", ", @offsets);

Output:

4, 15

Which represents existence of two capturing groups in input string.

like image 143
revo Avatar answered Nov 15 '22 03:11

revo


Without any limiting requirements for the occuring regexes, there is no definitive answer to the number of capture groups, I think. Just think of alternatives with a differing capture group count and the possibility of this occuring again in each branch:

my $re = qr/ A(B)C | A(D|(E(G+|H))F /x;

This regex can obviously have up to 3 capture groups. You could recursively parse each branch, and take the highest number as your result - but I honestly cannot come up with a practical way to do this in a short time. For 'linear' regexes not using alternatives or non-basic regex features, the task of determining the count of capture groups is possible, but I don't think it's feasible with more advanced ones.

like image 43
SREagle Avatar answered Nov 15 '22 04:11

SREagle