I'm trying to match nested {}
brackets with a regular expressions in Perl so that I can extract certain pieces of text from a file. This is what I have currently:
my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;
foreach (@matches) {
print "$_\n";
}
At certain times this works as expected. For instance, if $str = "abc {{xyz} abc} {xyz}"
I obtain:
abc {{xyz} abc} {xyz}
as expected. But for other input strings it does not function as expected. For example, if $str = "{abc} {{xyz}} abc"
, the output is:
{abc} {{xyz}} abc
which is not what I expected. I would have wanted {abc}
and {{xyz}}
to be on separate lines, since each is balanced on its own in terms of brackets. Is there an issue with my regular expression? If so, how would I go about fixing it?
You were surprised how your pattern matched, but noone explained it? Here's how your pattern is matching:
my @matches = $str =~ /\{(?:\{.*\}|[^{])*\}|\w+/sg;
^ ^ ^ ^ ^ ^
| | | | | |
{ ---------------------+ | | | | |
a --------------------------)-)-)--+ |
b --------------------------)-)-)--+ |
c --------------------------)-)-)--+ |
} --------------------------)-)-)--+ |
--------------------------)-)-)--+ |
{ --------------------------+ | | |
{ ----------------------------+ | |
x ----------------------------+ | |
y ----------------------------+ | |
z ----------------------------+ | |
} ------------------------------+ |
} ----------------------------------------+
As you can see, the problem is that /\{.*\}
/ matches too much. What should be in there is a something that matches
(?: \s* (?: \{ ... \} | \w+ ) )*
where the ...
is
(?: \s* (?: \{ ... \} | \w+ ) )*
So you need some recursion. Named groups are an easy way of doing this.
say $1
while /
\G \s*+ ( (?&WORD) | (?&BRACKETED) )
(?(DEFINE)
(?<WORD> \s* \w+ )
(?<BRACKETED> \s* \{ (?&TEXT)? \s* \} )
(?<TEXT> (?: (?&WORD) | (?&BRACKETED) )+ )
)
/xg;
But instead of reinventing the wheel, why not use Text::Balanced.
To match nested brackets with just one pair at each level of nesting,
but any number of levels, e.g. {1{2{3}}}
, you could use
/\{[^}]*[^{]*\}|\w+/g
To match when there may be multiple pairs at any level of nesting, e.g. {1{2}{2}{2}}
, you could use
/(?>\{(?:[^{}]*|(?R))*\})|\w+/g
The (?R)
is used to match the whole pattern recursively.
To match the text contained within a pair of brackets the engine must match (?:[^{}]*|(?R))*
,
i.e. either [^{}]*
or (?R)
, zero or more times *
.
So in e.g. "{abc {def}}"
, after the opening "{"
is matched, the [^{}]*
will match the "abc "
and the (?R)
will match the "{def}"
, then the closing "}"
will be matched.
The "{def}"
is matched because (?R)
is simply short for the whole pattern(?>\{(?:[^{}]*|(?R))*\})|\w+
, which as we have just seen will match a "{"
followed by text matching [^{}]*
, followed by "}"
.
Atomic grouping (?>
...)
is used to prevent the regex engine backtracking into bracketed text once it has been matched. This is important to ensure the regex will fail fast if it cannot find a match.
The problem of matching balanced and nested delimiters is covered in perlfaq5 and I'll leave it to them to cover all the options including (?PARNO) and Regexp::Common.
But matching balanced items is tricky and prone to error, unless you really want to learn and maintain advanced regexes, leave it to a module. Fortunately there is Text::Balanced to handle this and so very much more. It is the Swiss Army Chainsaw of balanced text matching.
Unfortunately it does not handle escaping on bracketed delimiters.
use v5.10;
use strict;
use warnings;
use Text::Balanced qw(extract_multiple extract_bracketed);
my @strings = ("abc {{xyz} abc} {xyz}", "{abc} {{xyz}} abc");
for my $string (@strings) {
say "Extracting from $string";
# Extract all the fields, rather than one at a time.
my @fields = extract_multiple(
$string,
[
# Extract {...}
sub { extract_bracketed($_[0], '{}') },
# Also extract any other non whitespace
qr/\S+/
],
# Return all the fields
undef,
# Throw out anything which does not match
1
);
say join "\n", @fields;
print "\n";
}
You can think of extract_multiple like a more generic and powerful split.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With