Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl: Regex to get all text between repeating patterns

I would like to create a regex for the following.

I have some text like the following:

field = "test string";
type =  INT;
funcCall(.., field, ...);
...
text = "desc";

field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";

field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";

.... keeps repeating

Basically I'm trying to create a regex that would get all text from the start of the first "field =" to the start of the second "field = ". It has to skip past the field text used in the function call.

I currently have the following:

my @overall = ($string =~ m/field\s*=.*?/gis);

However, this just obtains the text "field = ". Without the "?" it gets all the data from the first all the way to the very last instance.

I also tried:

my @overall = ($string =~ m/field\s*=.*field\s*=/gis);

However, that will then get me every other instance since it is possessive of the second "field =" string. Any suggestions?

like image 738
Coco Avatar asked Oct 26 '15 21:10

Coco


3 Answers

The easiest way I can see to do this is to split the $string by the /^\s*field\s*=/ expression. If we want to capture the 'field = ' portion of the text, we can break on a look-ahead:

foreach ( split /(?=^\s*field\s*=)/ms, $string ) {
    say "\$_=[\n$_]";
}

Thus, it breaks at the start of every line where 'field' is the next non-whitespace string, followed by any amount of whitespace, followed by a '='.

The output is:

$_=[
field = "test string";
type =  INT;
funcCall(.., field, ...);
...
text = "desc";
]
$_=[

]
$_=[
field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";
]
$_=[

]
$_=[
field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";

.... keeps repeating
]

Almost what I wanted. But, it leaves an artifact of a blank line that occurs between the captures we do want. I'm not sure how to get rid of it, so we'll just filter out all-whitespace strings:

foreach ( grep { m/\S/ } split /(?=^\s*field\s*=)/ms, $string ) {
    say "\$_=[\n$_]";
}

And then it yields:

$_=[
field = "test string";
type =  INT;
funcCall(.., field, ...);
...
text = "desc";
]
$_=[
field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";
]
$_=[
field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";

.... keeps repeating
]

Which you can work with.

like image 152
Axeman Avatar answered Oct 14 '22 04:10

Axeman


For overall "whipupitude" regarding your sample data I think passing a pattern to split is going to be the easiest. But, as @Schwern points out, when things get more complex using a grammar helps.

For fun I created an example script that parses your data using a parsing expression grammar built with Pegex. Both Regexp::Grammar and Regexp::Common have the advantage of widespread use and familiarity when it comes to quickly constructing a grammar. There's a low barrier to entry if you already know perl and need a simple but super powered version of regular expressions for your project. The Pegex approach is to try to make it easy to construct and use grammars with perl. With Pegex you build a parsing expression grammar out of regular expressions:

"Pegex... gets it name by combining Parsing Expression Grammars (PEG), with Regular Expessions (Regex). That's actually what Pegex does." (from the POD).

Below is a standalone script that parses a simplified version of your data using a Pegex grammar.


First the script reads out $grammar "inline" as a multi-line string and uses it to ->parse() the sample data which it reads from the <DATA> handle. Normally the parsing grammar and data would reside in separate files. The grammar's "atoms" and regular expressions are compiled using the pegex function into a "tree" or hash of regular expressions that is used to parse the data. The parse() method returns a data structure that can be used by perl. Adding use DDP and p $ast to the script can help you see what structures (AoH, HoH, etc.) are being returned by your grammar.

#!/usr/bin/env perl
use v5.22;
use experimental qw/ refaliasing postderef / ;
use Pegex;

my $data = do { local $/; <DATA> } ;

my $grammar = q[
%grammar thing
%version 0.0.1

things: +thing*
thing: (+field +type +text)+ % end 

value: / <DOUBLE> (<ANY>*) <DOUBLE> /
equals: / <SPACE> <EQUAL>  <SPACE> /
end: / BLANK* EOL / 

field: 'field' <equals> <value> <SEMI> <EOL>
type:  'type' <equals> /\b(INT|FLOAT)\b/ <SEMI> <EOL>
func:  / ('funcCall' LPAREN <ANY>* RPAREN ) / <SEMI> <EOL> .( <DOT>3 <EOL>)*
text:  'text' <equals> <value> <SEMI> <EOL>    
];

my $ast = pegex($grammar, 'Pegex::Tree')->parse($data);

for \my @things ( $ast->[0]->{thing}->@* ) {
  for \my %thing ( @things ) { 
    say $thing{"text"}[0] if $thing{"text"}[0] ; 
    say $thing{"func"}[0] if $thing{"func"}[0] ; 
  }
}

At the very end of the script a __DATA__ section holds the content of the file to parse:

__DATA__
field = "test string 0";
type = INT;
funcCall(.., field, ...);
...
text = "desc 1";

field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";

field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";    

You could of course just as easily read the data from a file handle or STDIN in the classic perl fashion or, for example, using IO::All where we could do:

use IO::All; 
my $infile < io shift ; # read from STDIN

You can install Pegex from CPAN and then download and play with the gist to get a feel for how Pegex works.

With Perl6 we are getting a powerful and easy to use "grammar engine" that builds on Perl's strengths in handling regular expressions. If grammars start to get used in a wider range of projects these developments are bound to feed back into perl5 and lead to even more powerful features.

The PEG part of Pegex and its cross language development allows grammars to be exchanged between different programming language communities (Ruby, Javascript). Pegex can be used in fairly simple scenarios, and fits nicely into more complex modules that require parsing capabilities. The Pegex API allows for easy creation of a rule derived set of functions that can be defined in a "receiver class". With a receiver class you can build sophisticated methods for working with your parsed data that allow you to "munge while you parse", and even modify the grammar on the fly (!) More examples of working grammars that can be re-purposed and improved, and a growing selection of modules that use Pegex will help it become more useful and powerful.

Perhaps the simplest approach to trying out the Pegex framework is Pegex::Regex - which allows you to use grammars as conveniently as regexps, storing the results of your parse in %/. The author of Pegex calls Pegex::Regex the "gateway drug" to parsing expression grammars and notes it is "a clone of Damian Conway's Regexp::Grammars module API" (covered by @Schwern in his answer to this question).

It's easy to get hooked.

like image 35
G. Cito Avatar answered Oct 14 '22 03:10

G. Cito


The quick and dirty way is to define a regex which mostly matches the field assignment, then use that in another regex to match what's between them.

my $field_assignment_re = qr{^\s* field \s* = \s* [^;]+ ;}msx;

$code =~ /$field_assignment_re (.*?) $field_assignment_re/msx;
print $1;

The downside of this approach is it might match quoted strings and the like.


You can sort of parse code with regular expressions, but parsing it correctly is beyond normal regular expressions. This is because of the high amount of balanced delimiters (ie. parens and braces) and escapes (ie. "<foo \"bar\"">"). To get it right you need to write a grammar.

Perl 5.10 added recursive decent matching to make writing grammars possible. They also added named capture groups to keep track of all those rules. Now you can write a recursive grammar with Perl 5.10 regexes.

It's still kinda clunky, Regexp::Grammar adds some enhancements to make writing regex grammars much easier.

Writing a grammar is about starting at some point and filling in the rules. Your program is a bunch of Statements. What's a Statement? An Assignment, or a FunctionCall followed by a ;. What's an Assignment? Variable = Expression. What is Variable and Expression? And so on...

use strict;
use warnings;
use v5.10;

use Regexp::Grammars;

my $parser = qr{
  <[Statement]>*

  <rule: Variable>      \w+
  <rule: FunctionName>  \w+
  <rule: Escape>        \\ .
  <rule: Unknown>       .+?
  <rule: String>        \" (?: <Escape> | [^\"] )* \"
  <rule: Ignore>        \.\.\.?
  <rule: Expression>    <Variable> | <String> | <Ignore>
  <rule: Assignment>    <Variable> = <Expression>
  <rule: Statement>     (?: <Assignment> | <FunctionCall> | <Unknown> ); | <Ignore>
  <rule: FunctionArguments>     <[Expression]> (?: , <[Expression]> )*
  <rule: FunctionCall>  <FunctionName> \( <FunctionArguments>? \)
}x;

my $code = <<'END';
field = "test \" string";
alkjflkj;
type =  INT;
funcCall(.., field, "escaped paren \)", ...);
...
text = "desc";

field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";

field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";
END

$code =~ $parser;

This is far more robust than a regex. The inclusion of:

<rule: Escape>        \\ .
<rule: String>        \" (?: <Escape> | [^\"] )* \"

Handles otherwise tricky edge cases like:

funcCall( "\"escaped paren \)\"" );

It all winds up in %/. Here's the first part.

$VAR1 = {
          'Statement' => [
                           {
                             'Assignment' => {
                                               'Variable' => 'field',
                                               'Expression' => {
                                                                 'String' => '"test string"',
                                                                 '' => '"test string"'
                                                               },
                                               '' => 'field = "test string"'
                                             },
                             '' => 'field = "test string";'
                           },
          ...

Then you can loop through the Statement array looking for Assignments where the Variable matches field.

my $seen_field_assignment = 0;
for my $statement (@{$/{Statement}}) {
    # Check if we saw 'field = ...'
    my $variable = ($statement->{Assignment}{Variable} || '');
    $seen_field_assignment++ if $variable eq 'field';

    # Bail out if we saw the second field assignment
    last if $seen_field_assignment > 1;

    # Print if we saw a field assignment
    print $statement->{''} if $seen_field_assignment;
}

This might seem like a lot of work, but it's worth learning how to write grammars. There's a lot of problems which can be half-solved with regexes, but fully solved with a simple grammar. In the long run, the regex will get more and more complicated and never quite cover all the edge cases, while a grammar is easier to understand and can be made perfect.

The downside of this approach is your grammar might not be complete and it might trip up, though the Unknown rule will take care of most of that.

like image 44
Schwern Avatar answered Oct 14 '22 03:10

Schwern