Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for matching indented continuation lines

Tags:

regex

perl

Need match key = value pairs in arbitrary text using the following rules.

  • the leading line has a structure:
    • start with indentation - "two spaces or tab" at leas once, e.g.: ( |\t)+
    • the + character and one space
    • words VAR or CONST
    • and the key and value using the = character

Examples:

  + VAR somename = somevalue (indented with two spaces)
        + VAR name3 = indented by one \t

The following regex matches such lines:

/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/

Now the problem: The syntax allows continuation lines, e.g. when the above line is followed by the line which starts at least one indentation sequence ( |\t) (aka TWO spaces or one tab) is considered as an continuation line and its whole content (with leading spaces too) should be the value for the key in previous line.

Example:

  + VAR multi = 3 line value where the continuation lines
  are indented (starts with two spaces or one tab)
  and NOT followed by the '+'

e.g., the regex for the continuation line is

/^(  |\t)+([^\+](.*))$/

The solution is easy with line based approach, e.g. when I split the whole text into lines and processes it line-by-line.

But, I looking for an (complex) regex (mainly for learning and benchmarking purposes) which could match the key=value pairs in one line or multiline form. Tried this:

while( $text =~ m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=(  |\t)+[^\+](.*)$)*)/gm ) {
    ...
}

but I got:

(?=(  |\t)+[^\+](.*)$)* matches null string many times in regex; marked by <-- HERE in m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=(  |\t)+[^\+](.*)$)* <-- HERE )/ at so line 36.

Side question: how to use the multi-line extended regexes, like:

/
   ^(  |\t)+      # <- space ... :(
   \+\s+
   (VAR|CONST)
   \s+
   (\w+)
   \s*=\s*
   (.*)$
/x

when the regex must contain exactly the SPACE character (e.g. can't use the universal \s)?

If someone want help, here is a code which produces the wanted output (using the line-based approach) and also the non-working regex-based solution.

#!/usr/bin/env perl
use 5.014;
use warnings;
use Data::Dumper;

my $txt = do { local $/; <DATA> };

my @matches1 = parse_by_lines($txt // '');
mydump('BY LINES', @matches1);

my @matches2 = parse_by_one_regex($txt // '');
mydump('REGEX', @matches2);

sub parse_by_lines { #produces the wanted output
    my ($text) = @_;
    my @match;
    my $havekey;
    for my $line (split "\n", $text) {
        if( $line =~ m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/ ) {
            push @match, { indent => $1, type => $2, key => $3, val => $4 };
            $havekey++;
        }
        elsif( $havekey && $line =~ m/^(  |\t)+([^\+](.*))$/ ) {    #continuation line
            $match[-1]->{val} .= "\n$line"; #prserve the \n in the val
        }
        else {
            $havekey = 0;
        }
    }
    return @match;
}


sub parse_by_one_regex { #not working
    my ($text) = @_;
    my @match;
    while( $text =~ m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=(  |\t)+[^\+](.*)$)*)/gm ) {
        push @match, { indent => $1, type => $2, key => $3, val => $4 };
    }
    return @match;
}

sub mydump {
    my($label, @match) = @_;
    say "#### $label ####";
    for my $m ( @match ) {
        printf "%-6s: [%s]\n", $_, $m->{$_} for (qw(indent type key val));
        print "\n";
    }
}

__DATA__
some arbitrary text lines
or empty lines

    could be indented
  and could contain any character

  + VAR name1 = var indented by two spaces and the first nonspace character is '+'
line of arbitrary text
    + VAR name2 = var indented by 2x2 spaces

    + VAR name3 = var indented by one \t
  + VAR name4 = the next line with "name5" is not valid. missing the = character, should not be matched
  + VAR name5
  + CONST name6 = the type could be VAR or CONST

  + VAR multi1 = multiline value where the continuation lines
  are indented (starts with two spaces or one tab) and NOT followed by the '+'

  + VAR multi1 = multiline value
    indented

  + VAR multi1 = multiline value
     indented ok too


  + VAR single = this is single line
  + because this line even if it is indented, the first nonspace character is '+'

  + VAR multi2 = multiline
  could be
     indented
        any way
  and any number of times
  until the first non-indented line

the following should NOT match

+ VAR some = sould not be matched, because the line isn't indented
 + VAR some = sould not be matched, because the line isn't indented at least with TWO spaces or one tab
  + SOME name = value not matched because the SOME isn't VAR or CONST

EDIT: using the accepted answer, and adding the wanted capture groups, got the following:

    while( $text =~ /
            (?m)            # multiline match
            ^               # at the start of the line
            ([ ]{2}|\t)+    # two spaces or tab - at least once
            \+              # the '+' character
            \s*             # followed by any number of spaces (e.g. "+VAR" or "+    VAR" are valid)
            (VAR|CONST)     # the VAR or CONST
            \s+             # followed at least one space (e.g. the "VAR_" should not matched)
            (\w+)           # the keyword
            \s*=\s*         # the '=' surrounded (and consumed) by any number of spaces
            (               # capture the whole value (as it is)
                    .*                      # any string up to end of line
                    (?:                     # followed by (non-capturing group)
                            \R              # one line-break
                            ^               # at the start of the line
                            (?>[ ]{2,}|\t+) # atomic group - at least two spaces or at least one tab
                            [^+]            # followed by any character but '+'
                            .*              # any string up the end of line
                    )*              # any number of times (e.g. optionally)
            )
    /xg) {
            push @match, { indent => $1, type => $2, key => $3, val => $4 };
    }

EDIT2 And yes, the regex based solution is 34% faster (at least on my HW).

like image 778
cajwine Avatar asked Sep 16 '16 10:09

cajwine


1 Answers

Regex:

(?m)^(?:  +|\t+)\+ *(?:VAR|CONST) *\w+ *=.*(?:\R^(?>  +|\t+)[^+\s].*)*

Live demo

The important part is last cluster:

(?:                # Start of non-capturing group (a)
    \R             # One line-break
    ^              # Start of line
    (?>  +|\t+)    # At least two spaces or one tab character (possessively)
    [^+\s]         # Not followed by `+` or a newline character
    .*             # Up to end of line
)*                 # Repeat it as much as possible - end of non-capturing group (a)

Answer to your second question:

Literal space characters are simply ignored as a meaningful part of Regular Expression while x modifier is set unless you enclose it in character classes [ ] and use quantifiers [ ]{2,} to express times they should appear.

/
    (?m)
    ^
    (?:
        [ ]{2,}
        |
        \t+
    )\+
    [ ]*
    (?:
        VAR
        |
        CONST
    )
    [ ]*\w+[ ]*=.*
    (?:
        \R
        ^
        (?>
            [ ]{2,}
            |
            \t+
        )
        [^+\s].*
    )*
/x

Live demo

like image 54
revo Avatar answered Nov 07 '22 06:11

revo