Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perl regex for extracting multiline blocks

Tags:

regex

perl

I have text like this:

00:00 stuff
00:01 more stuff
multi line
  and going
00:02 still 
    have

So, I don't have a block end, just a new block start.

I want to recursively get all blocks:

1 = 00:00 stuff
2 = 00:01 more stuff
multi line
  and going

etc

The bellow code only gives me this:

$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';

What am I doing wrong?

my $text = '00:00 stuff
00:01 more stuff
multi line
 and going
00:02 still 
have
    ';
my @array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(@array);
like image 552
cristi Avatar asked May 14 '12 12:05

cristi


2 Answers

Version 5.10.0 introduced named capture groups that are useful for matching nontrivial patterns.

(?'NAME'pattern)
(?<NAME>pattern)

A named capture group. Identical in every respect to normal capturing parentheses () but for the additional fact that the group can be referred to by name in various regular expression constructs (such as \g{NAME}) and can be accessed by name after a successful match via %+ or %-. See perlvar for more details on the %+ and %- hashes.

If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match.

The forms (?'NAME'pattern) and (?<NAME>pattern) are equivalent.

Named capture groups allow us to name subpatterns within the regex as in the following.

use 5.10.0;  # named capture buffers

my $block_pattern = qr/
  (?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))

  (?(DEFINE)
    # timestamp at logical beginning-of-line
    (?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])

    # runs of spaces or tabs
    (?<_sp> [ \t]+)

    # description is everything through the end of the record
    (?<_desc>
      # s switch makes . match newline too
      (?s: .+?)

      # terminate before optional whitespace (which we remove) followed
      # by either end-of-string or the start of another block
      (?= (?&_sp)? (?: $ | (?&_time)))
    )
  )
/x;

Use it as in

my $text = '00:00 stuff
00:01 more stuff
multi line
 and going
00:02 still
have
    ';

while ($text =~ /$block_pattern/g) {
  print "time=[$+{time}]\n",
        "desc=[[[\n",
        $+{desc},
        "]]]\n\n";
}

Output:

$ ./blocks-demo
time=[00:00]
desc=[[[
stuff
]]]

time=[00:01]
desc=[[[
more stuff
multi line
 and going
]]]

time=[00:02]
desc=[[[
still
have
]]]
like image 83
Greg Bacon Avatar answered Oct 29 '22 05:10

Greg Bacon


This should do the trick. Beginning of next \d\d:\d\d is treated as block end.

use strict;

my $Str = '00:00 stuff
00:01 more stuff
multi line
  and going
00:02 still 
    have
00:03 still 
    have' ;

my @Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);

print join "--\n", @Blocks;
like image 32
tuxuday Avatar answered Oct 29 '22 03:10

tuxuday