Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl regular expression: match nested brackets

Tags:

regex

perl

I'm trying to match nested {} brackets with a regular expressions in Perl so that I can extract certain pieces of text from a file. This is what I have currently:

my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;

foreach (@matches) {
    print "$_\n";
}

At certain times this works as expected. For instance, if $str = "abc {{xyz} abc} {xyz}" I obtain:

abc
{{xyz} abc}
{xyz}

as expected. But for other input strings it does not function as expected. For example, if $str = "{abc} {{xyz}} abc", the output is:

{abc} {{xyz}}
abc

which is not what I expected. I would have wanted {abc} and {{xyz}} to be on separate lines, since each is balanced on its own in terms of brackets. Is there an issue with my regular expression? If so, how would I go about fixing it?

like image 650
arshajii Avatar asked Mar 08 '13 19:03

arshajii


3 Answers

You were surprised how your pattern matched, but noone explained it? Here's how your pattern is matching:

my @matches = $str =~ /\{(?:\{.*\}|[^{])*\}|\w+/sg;
                       ^    ^ ^ ^  ^      ^
                       |    | | |  |      |
{ ---------------------+    | | |  |      |
a --------------------------)-)-)--+      |
b --------------------------)-)-)--+      |
c --------------------------)-)-)--+      |
} --------------------------)-)-)--+      |
  --------------------------)-)-)--+      |
{ --------------------------+ | |         |
{ ----------------------------+ |         |
x ----------------------------+ |         |
y ----------------------------+ |         |
z ----------------------------+ |         |
} ------------------------------+         |
} ----------------------------------------+

As you can see, the problem is that /\{.*\}/ matches too much. What should be in there is a something that matches

(?: \s* (?: \{ ... \} | \w+ ) )*

where the ... is

(?: \s* (?: \{ ... \} | \w+ ) )*

So you need some recursion. Named groups are an easy way of doing this.

say $1
   while /
      \G \s*+ ( (?&WORD) | (?&BRACKETED) )

      (?(DEFINE)
         (?<WORD>      \s* \w+ )
         (?<BRACKETED> \s* \{ (?&TEXT)? \s* \} )
         (?<TEXT>      (?: (?&WORD) | (?&BRACKETED) )+ )
      )
   /xg;

But instead of reinventing the wheel, why not use Text::Balanced.

like image 143
ikegami Avatar answered Sep 22 '22 08:09

ikegami


To match nested brackets with just one pair at each level of nesting,
but any number of levels, e.g. {1{2{3}}}, you could use

/\{[^}]*[^{]*\}|\w+/g

To match when there may be multiple pairs at any level of nesting, e.g. {1{2}{2}{2}}, you could use

/(?>\{(?:[^{}]*|(?R))*\})|\w+/g

The (?R) is used to match the whole pattern recursively.

To match the text contained within a pair of brackets the engine must match (?:[^{}]*|(?R))*,
i.e. either [^{}]* or (?R), zero or more times *.

So in e.g. "{abc {def}}", after the opening "{" is matched, the [^{}]* will match the "abc " and the (?R) will match the "{def}", then the closing "}" will be matched.

The "{def}" is matched because (?R) is simply short for the whole pattern
(?>\{(?:[^{}]*|(?R))*\})|\w+, which as we have just seen will match a "{" followed by text matching [^{}]*, followed by "}".

Atomic grouping (?>...) is used to prevent the regex engine backtracking into bracketed text once it has been matched. This is important to ensure the regex will fail fast if it cannot find a match.

like image 26
MikeM Avatar answered Sep 23 '22 08:09

MikeM


The problem of matching balanced and nested delimiters is covered in perlfaq5 and I'll leave it to them to cover all the options including (?PARNO) and Regexp::Common.

But matching balanced items is tricky and prone to error, unless you really want to learn and maintain advanced regexes, leave it to a module. Fortunately there is Text::Balanced to handle this and so very much more. It is the Swiss Army Chainsaw of balanced text matching.

Unfortunately it does not handle escaping on bracketed delimiters.

use v5.10;
use strict;
use warnings;

use Text::Balanced qw(extract_multiple extract_bracketed);

my @strings = ("abc {{xyz} abc} {xyz}", "{abc} {{xyz}} abc");

for my $string (@strings) {
    say "Extracting from $string";

    # Extract all the fields, rather than one at a time.
    my @fields = extract_multiple(
        $string,
        [
            # Extract {...}
            sub { extract_bracketed($_[0], '{}') },
            # Also extract any other non whitespace
            qr/\S+/
        ],
        # Return all the fields
        undef,
        # Throw out anything which does not match
        1
    );

    say join "\n", @fields;
    print "\n";
}

You can think of extract_multiple like a more generic and powerful split.

like image 27
Schwern Avatar answered Sep 24 '22 08:09

Schwern