Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl - Regex to match double quoted text

Need some help with regex matching please. I'm trying to match a double quoted string of text, within a large string, that itself can contain pairs of double quotes! Here's an example:

"Please can ""you"" match this"

A fuller example of my problem and where I've got so far is shown below. The code below only stores 'paris' correctly in the hash, both london and melbourne are incorrect due to the double quote pair terminating the long description early.

Any help much appreciated.

use strict;
use warnings;
use Data::Dumper;

my %hash;

my $delimiter = '/begin CITY';
local $/ = $delimiter;

my $top_of_file = <DATA>;
my $records=0;

while(<DATA>) {

   my ($section_body) = m{^(.+)/end CITY}ms;

   $section_body =~ s{/\*.*?\*/}{}gs;     # Remove any comments in string

   $section_body =~ m{  ^\s+(.+?)   ## Variable name is never whitespace seperated
                                    ## Always underscored.  Akin to C variable names

                        \s+(".*?")  ## The long description can itself contain
                                    ## pairs of double quotes ""like this""

                        \s+(.+)     ## Everything from here can be split on
                                    ## whitespace

                        \s+$
                     }msx;

   $hash{$records}{name} = $1;
   $hash{$records}{description} = $2;

   my (@data) = split ' ', $3;

   @{ $hash{$records} }{qw/ size currency /} = @data;

   ++$records;
}

print Dumper(\%hash);


__DATA__
Some header information

/begin CITY

    london  /* city name */
    "This is a ""difficult"" string to regex"
    big
    Sterling

/end CITY

/begin CITY paris
         "This is a simple comment to grab."
         big
         euro  /* the address */
/end CITY


/begin CITY

    Melbourne
    "Another ""hard"" long description to 'match'."
    big
    Dollar

/end CITY
like image 874
Chris Avatar asked Dec 28 '25 16:12

Chris


2 Answers

Change this:

".*?"

to this:

"(?>(?:[^"]+|"")*)"

Also, your use of non-greedy matching isn't very safe. Something like this:

\s+(.+?)   ## Variable name is never whitespace seperated
           ## Always underscored.  Akin to C variable names

could well end up including whitespace inside the variable-name, if Perl finds that that's the only way to match. (It will prefer to stop before including whitespace, but it makes no guarantees.)

And you should always check to make sure that m{} found something. If you're sure that it will always match, then you can just tack on an or die to validate that.

like image 56
ruakh Avatar answered Dec 30 '25 12:12

ruakh


I don't know how much luck you are going to have with parsing quoted text with your own regexes, it can be pretty dicey business. I would look at a module like Text::Balanced.

https://metacpan.org/pod/Text::Balanced

That ought to do what you need it too, and a good bit less painfully.

I know I'm supposed to answer the question as asked, but regexes are really not the way you want to do this.

like image 28
Sean O'Leary Avatar answered Dec 30 '25 12:12

Sean O'Leary