I'm trying to write a recursive regular expression to capture code blocks, but for some reason it seems to not be capturing them properly. I would expect the code below to capture the full body of the function, but instead it only captures the contents of the first if
statement.
It's almost like the .+?
is somehow gobbling up the first {
, but it's supposed to be non-greedy so I don't understand why it would.
What is causing it to act this way?
Script:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
END
# Regular expression to capture balanced "{}" groups
my $regex = qr/
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?R) # Recursively match the entire expression
)* # Match 0 or more times
\} # Match closing brace
/x;
# is ".+?" gobbling up the first "{"?
# What would cause it to do this?
if ($text =~ m/int\s.+?($regex)/s){
print $1;
}
Output:
{
return x;
}
Expected Output:
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
I know that there is a Text::Balanced
module for this purpose, but I am attempting to do this by hand in order to learn more about regular expressions.
(?R)
recurses into the whole pattern – but what is the whole pattern? When you embed the quoted $regex
into /int\s.+?($regex)/
, the pattern is recompiled and (?R)
refers to the new pattern. That's not what you intended.
I'd recommend you use named captures instead so that you can recurse by name. Change the $regex
like
/(?<nestedbrace> ... (?&nestedbrace) ...)/
If you want to avoid extra captures, you can use the (?(DEFINE) ...)
syntax to declare named regex patterns that can be called later:
my $define_nestedbrace_re = qr/(?(DEFINE)
(?<nestedbrace ... (?&nestedbrace) ...)
)/x;
Then: /int\s.+?((?&nestedbrace))$define_nestedbrace_re/
That won't create additional captures. However, it is not generally possible to write encapsulated regex fragments. Techniques like preferring named captures instead of numbered captures can help here.
You can change your recursive pattern to this one:
/int\s+.*? (
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace chars without backtracking
| # OR
(?-1) # Recursively match the previous group
)* # Match 0 or more times
\}
)/sx
(?-1)
instead of (?R)
that recurses whole matched pattern.(?-1)
is back-reference of previous capturing group.Updated RegEx Demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With