Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:

use strict;
use warnings;

my $text = << "END";
int max(int x, int y)
{
    if (x > y)
    {
        return x;
    }
    else
    {
        return y;
    }
}
// more stuff to capture
END

# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
    (?<block>
        \{                # Match opening brace
            (?:           # Start non-capturing group
                [^{}]++   #     Match non-brace characters without backtracking
                |         #     or
                (?&block) #     Recursively match the last captured group
            )*            # Match 0 or more times
        \}                # Match closing brace
    )
)/x;

# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
    print $1;
    print $2;
}

I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.

$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.

like image 811
tjwrona1992 Avatar asked Sep 08 '17 15:09

tjwrona1992


2 Answers

Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:

m/
  (.+?)  # group 1
  (?:  # the $code_block regex
    (?&block)
    (?(DEFINE)
      (?<block> ... )  # group 2
    )
  )
  (.+)  # group 3
/xs

Named groups can also be accessed as numbered groups.

The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.

As a consequence, the text after the code-block will be stored in capture $3.

There are two ways to deal with this problem:

  • For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:

    if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
        print $+{before};
        print $+{afterwards};
    }
    
  • Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.

like image 179
amon Avatar answered Sep 19 '22 10:09

amon


There are also ready tools that can be leveraged for this, in a few lines of code.

Perhaps the first module to look at is the core Text::Balanced.

The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder

use warnings;
use strict;
use feature 'say';

use Text::Balanced qw/extract_bracketed/;

my $text = 'start {some {stuff} one} and {more {of it} two}, and done';

my ($match, $lead);
while (1) {
    ($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
    say $lead // $text;
    last if not defined $match; 
}

what prints

start 
 and 
, and done

Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.

I've used a made-up string above, but I tested it on your code sample as well.


This can also be done using Regexp::Common.

Break the string using its $RE{balanced} regex, then take odd elements

use Regexp::Common qw(balanced);

my @parts = split /$RE{balanced}{-parens=>'{}'}/, $text;

my @out_of_blocks = @parts[  grep { $_ & 1 } 1..$#parts ];

say for @out_of_blocks;

If the string starts with the delimiter the first element is an empty string, as usual with split.

To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.

like image 35
zdim Avatar answered Sep 19 '22 10:09

zdim