I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
// more stuff to capture
END
# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
(?<block>
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?&block) # Recursively match the last captured group
)* # Match 0 or more times
\} # Match closing brace
)
)/x;
# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
print $1;
print $2;
}
I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE
block? I would think that this should work fine.
$2
should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.
Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:
m/
(.+?) # group 1
(?: # the $code_block regex
(?&block)
(?(DEFINE)
(?<block> ... ) # group 2
)
)
(.+) # group 3
/xs
Named groups can also be accessed as numbered groups.
The 2nd group is the block
group. However, this group is only used as a named subpattern, not as a capture. As such, the $2
capture value is undef.
As a consequence, the text after the code-block will be stored in capture $3
.
There are two ways to deal with this problem:
For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:
if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
print $+{before};
print $+{afterwards};
}
Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block
regex would only define a named pattern which you then invoke explicitly.
There are also ready tools that can be leveraged for this, in a few lines of code.
Perhaps the first module to look at is the core Text::Balanced.
The extract_bracketed
in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw/extract_bracketed/;
my $text = 'start {some {stuff} one} and {more {of it} two}, and done';
my ($match, $lead);
while (1) {
($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
say $lead // $text;
last if not defined $match;
}
what prints
start and , and done
Once there is no match we need to print the remainder, thus $lead // $text
(as there can be no $lead
either). The code uses $text
directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.
I've used a made-up string above, but I tested it on your code sample as well.
This can also be done using Regexp::Common.
Break the string using its $RE{balanced}
regex, then take odd elements
use Regexp::Common qw(balanced);
my @parts = split /$RE{balanced}{-parens=>'{}'}/, $text;
my @out_of_blocks = @parts[ grep { $_ & 1 } 1..$#parts ];
say for @out_of_blocks;
If the string starts with the delimiter the first element is an empty string, as usual with split
.
To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With