I have the following string:
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
I need a regular expression to split this line but ignore spaces in double quotes in Perl.
The following is what I tried but it does not work.
(".*?"|\S+)
How can we split a string in Perl on whitespace? The simplest way of doing this is to use the split() function, supplying a regular expression that matches whitespace as the first argument.
Perl | split() Function. split() is a string function in Perl which is used to split or you can say to cut a string into smaller sections or pieces. There are different criteria to split a string, like on a single character, a regular expression(pattern), a group of characters or on undefined value etc..
The delimiter can be a character, a list of characters, a regular expression pattern, the hash value, and an undefined value. This function can be used in different ways by Perl script. Different uses of the split() function in Perl have been shown in this tutorial by using multiple examples.
Once upon a time I also tried to re-invent the wheel, and solve this myself.
Now I just use Text::ParseWords and let it do the job for me.
Update: It looks like the fields are actually tab separated, not space. If that is guaranteed, just split on \t
.
First, let's see why (".*?"|\S+)
"does not work". Specifically, look at ".*?"
That means zero or more characters enclosed in double-quotes. Well, the field that is giving you problems is ""C:\Program Files\ABC\ABC XYZ""
. Note that each ""
at the beginning and end of that field will match ".*?"
because ""
consists of zero characters surrounded with double quotes.
It is better to match as specifically as possible rather than splitting. So, if you have a configuration file with directives and a fixed format, form a regular expression match that is as close to the format you are trying to match as possible.
Move the quotation marks outside of the capturing parentheses if you don't want them.
#!/usr/bin/perl
use strict;
use warnings;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my @parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})};
use Data::Dumper;
print Dumper \@parts;
Output:
$VAR1 = [
'StartProgram',
'1',
'""C:\\Program Files\\ABC\\ABC XYZ""',
'CleanProgramTimeout',
'1',
'30'
];
In that vein, here is a more involved script:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @strings = split /\n/, <<'EO_TEXT';
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30
EO_TEXT
my $re = qr{
(?<directive>StartProgram)\s+
(?<instance>[0-9][0-9]?)\s+
(?<path>"".+?""|\S+)\s+
(?<timeout_directive>CleanProgramTimeout)\s+
(?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2})
}x;
for (@strings) {
if ( $_ =~ $re ) {
print Dumper \%+;
}
}
Output:
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => '""C:\\Program Files\\ABC\\ABC XYZ""',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => 'c:\\opt\\perl',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
Update: I cannot get Text::Balanced
or Text::ParseWords
to parse this correctly. I suspect the problem is the repeated quotation marks that delineate the substring that should not be split. The following code is my best (not very good) attempt at solving the generic problem by using split and then selective re-gathering of parts of the string.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30};
print Dumper parse_line($s);
print Dumper parse_line($t);
sub parse_line {
my ($line) = @_;
my @parts = split /(\s+)/, $line;
my @real_parts;
for (my $i = 0; $i < @parts; $i += 1) {
unless ( $parts[$i] =~ /^""/ ) {
push @real_parts, $parts[$i] if $parts[$i] =~ /\S/;
next;
}
my $part;
do {
$part .= $parts[$i++];
} until ($part =~ /""$/);
push @real_parts, $part;
}
return \@real_parts;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With