Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl split list on commas except when within brackets?

Tags:

regex

perl

I have a database with a number of fields containing comma separated values. I need to split these fields in Perl, which is straightforward enough except that some of the values are followed by nested CSVs contained in brackets that I do not want to split.

Example:

recycling, environmental science, interdisciplinary (e.g., consumerism, waste management, chemistry, toxicology, government policy, and ethics), consumer education

Splitting on ", " gives me:

recycling
environmental science
interdisciplinary (e.g.
consumerism
waste management
chemistry
toxicology
government policy
and ethics)
consumer education

What I want is:

recycling
environmental science
interdisciplinary (e.g., consumerism, waste management, chemistry, toxicology, government policy, and ethics)
consumer education

Can any Perl regex(perts) lend a hand?

I have tried modifying a regex string I found in a similar SO post which returns no results:

#!/usr/bin/perl

use strict;
use warnings;

my $s = q{recycling, environmental science, interdisciplinary (e.g., consumerism, waste management, chemistry, toxicology, government policy, and ethics), consumer education};

my @parts = $s =~ m{\A(\w+) ([0-9]) (\([^\(]+\)) (\w+) ([0-9]) ([0-9]{2})};

use Data::Dumper;
print Dumper \@parts;
like image 541
calyeung Avatar asked Feb 24 '12 17:02

calyeung


2 Answers

Try this:

my $s = q{recycling, environmental science, interdisciplinary (e.g., consumerism, waste management, chemistry, toxicology, government policy, and ethics), consumer education};

my @parts = split /(?![^(]+\)), /, $s;
like image 110
raina77ow Avatar answered Nov 23 '22 11:11

raina77ow


The solution you have chosen is superior, but to those who would say otherwise, regular expressions have a recursion element which will match nested parentheses. The following works fine

use strict;
use warnings;

my $s = q{recycling, environmental science, interdisciplinary (e.g., consumerism, waste management, chemistry, toxicology, government policy, and ethics), consumer education};

my @parts;

push @parts, $1 while $s =~ /
((?:
  [^(),]+ |
  ( \(
    (?: [^()]+ | (?2) )*
  \) )
)*)
(?: ,\s* | $)
/xg;


print "$_\n" for @parts;

even if the parentheses are nested further. No it's not pretty but it does work!

like image 43
Borodin Avatar answered Nov 23 '22 09:11

Borodin