Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can Perl split a line on whitespace except when the whitespace is in doublequotes?

Tags:

regex

split

perl

I have the following string:

StartProgram    1   ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout    1   30

I need a regular expression to split this line but ignore spaces in double quotes in Perl.

The following is what I tried but it does not work.

(".*?"|\S+)
like image 354
Avinash Avatar asked Oct 14 '09 12:10

Avinash


People also ask

How do I split a space in Perl?

How can we split a string in Perl on whitespace? The simplest way of doing this is to use the split() function, supplying a regular expression that matches whitespace as the first argument.

How do I split a string in Perl?

Perl | split() Function. split() is a string function in Perl which is used to split or you can say to cut a string into smaller sections or pieces. There are different criteria to split a string, like on a single character, a regular expression(pattern), a group of characters or on undefined value etc..

What is delimiter in Perl?

The delimiter can be a character, a list of characters, a regular expression pattern, the hash value, and an undefined value. This function can be used in different ways by Perl script. Different uses of the split() function in Perl have been shown in this tutorial by using multiple examples.


2 Answers

Once upon a time I also tried to re-invent the wheel, and solve this myself.

Now I just use Text::ParseWords and let it do the job for me.

like image 134
Colin Fine Avatar answered Oct 12 '22 23:10

Colin Fine


Update: It looks like the fields are actually tab separated, not space. If that is guaranteed, just split on \t.

First, let's see why (".*?"|\S+) "does not work". Specifically, look at ".*?" That means zero or more characters enclosed in double-quotes. Well, the field that is giving you problems is ""C:\Program Files\ABC\ABC XYZ"". Note that each "" at the beginning and end of that field will match ".*?" because "" consists of zero characters surrounded with double quotes.

It is better to match as specifically as possible rather than splitting. So, if you have a configuration file with directives and a fixed format, form a regular expression match that is as close to the format you are trying to match as possible.

Move the quotation marks outside of the capturing parentheses if you don't want them.

#!/usr/bin/perl

use strict;
use warnings;

my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};

my @parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})};

use Data::Dumper;
print Dumper \@parts;

Output:

$VAR1 = [
          'StartProgram',
          '1',
          '""C:\\Program Files\\ABC\\ABC XYZ""',
          'CleanProgramTimeout',
          '1',
          '30'
        ];

In that vein, here is a more involved script:

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

my @strings = split /\n/, <<'EO_TEXT';
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30
EO_TEXT

my $re = qr{
    (?<directive>StartProgram)\s+
    (?<instance>[0-9][0-9]?)\s+
    (?<path>"".+?""|\S+)\s+
    (?<timeout_directive>CleanProgramTimeout)\s+
    (?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2})
}x;

for (@strings) {
    if ( $_ =~ $re ) {
        print Dumper \%+;
    }
}

Output:

$VAR1 = {
          'timeout_directive' => 'CleanProgramTimeout',
          'timeout_seconds' => '30',
          'path' => '""C:\\Program Files\\ABC\\ABC XYZ""',
          'directive' => 'StartProgram',
          'timeout_instance' => '1',
          'instance' => '1'
        };
$VAR1 = {
          'timeout_directive' => 'CleanProgramTimeout',
          'timeout_seconds' => '30',
          'path' => 'c:\\opt\\perl',
          'directive' => 'StartProgram',
          'timeout_instance' => '1',
          'instance' => '1'
        };

Update: I cannot get Text::Balanced or Text::ParseWords to parse this correctly. I suspect the problem is the repeated quotation marks that delineate the substring that should not be split. The following code is my best (not very good) attempt at solving the generic problem by using split and then selective re-gathering of parts of the string.

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};

my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30};

print Dumper parse_line($s);
print Dumper parse_line($t);

sub parse_line {
    my ($line) = @_;
    my @parts = split /(\s+)/, $line;
    my @real_parts;

    for (my $i = 0; $i < @parts; $i += 1) {
        unless ( $parts[$i] =~ /^""/ ) {
            push @real_parts, $parts[$i] if $parts[$i] =~ /\S/;
            next;
        }
        my $part;
        do {
            $part .= $parts[$i++];
        } until ($part =~ /""$/);
        push @real_parts, $part;
    }
    return \@real_parts;
}
like image 24
Sinan Ünür Avatar answered Oct 12 '22 23:10

Sinan Ünür