Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

match all commas that are outside parentheses and square brackets in perl regex

Tags:

regex

perl

I'm trying to match, using regex, all commas(followed by a space): , that are outside any parentheses or square brackets, i.e. the comma should not be contained in the parentheses or square brackets.

The target string is A, An(hi, world[hello, (hi , world) world]); This, These. In this case, it should match the first comma and the last comma (the ones between A and An, this and these).

So I could split A, An(hi, world[hello, (hi , world) world]); This, These into A, An(hi, world[hello, (hi , world) world]); This and These, not leaving parens/brackets unbalanced as a result.

To that end, it seems hard to use regex alone. Is there any other approach to this problem?

The regex expression I'm using: , (?![^()\[\]]*[\)\]])

But this expression will match other extra two commas , (the second and the third) which shouldn't have been matched.

Though if it is matching against the following strings, it'll match the right comma (the first one respectively): A, An(hi, world) and A, An[hi, world]

But if the parenthesis and brackets contain each other, it'll be problems.

More details in this link: https://regex101.com/r/g8DOh6/1

like image 595
jonah_w Avatar asked Oct 29 '25 06:10

jonah_w


2 Answers

The problem here is in identifying "balanced" pairs, of parenthesis/brackets in this case. This is a well recognized problem, for which there are libraries. They can find the top-level matching pairs, (...)/[...] with all that's inside, and all else outside parens -- then process the "else."

One way, using Regexp::Common

use warnings;
use strict;
use feature 'say';

use Regexp::Common;

my $str = shift // q{A, t(a,b(c,))u B, C, p(d,)q D,}; 

my @all_parts = split /$RE{balanced}{-parens=>'()[]'}/, $str;

my @no_paren_parts = grep { not /\(.*\) | \[.*\]/x } @all_parts;

say for @no_paren_parts;

This uses split's property to return the list with separators included when the regex in the separator pattern captures. The library regex captures so we get it all back -- the parts obtained by splitting the string by what regex matched but also the parts matched by the regex. The separators contain the paired delimiters while other terms cannot, by construction, so I filter them out by that. Prints

A, t
u B, C, p
q D,

The paren/bracket terms are gone, but how the string is split is otherwise a bit arbitrary.

The above is somewhat "generic," using the library merely to extract the balanced pairs ()/[], along with all other parts of the string. Or, we can remove those patterns from the string

$str =~ s/$RE{balanced}{-parens=>'()[]'}//g;

to stay with

A, tu B, C, pq D,

Now one can simply split by commas

my @terms = split /\s*,\s*/, $str;
say for @terms;

for

A
tu B
C
pq D

This is the desired result in this case, as clarified in comments.

Another most notable library, in many ways more fundamental, is the core Text::Balance. See Shawn's answer here, and for example this post and this one and this one for examples.


An example. With

my $str = q(it, is; surely);

my @terms = split /[,;]/, $str;

one gets it is surely in the array @terms, while with

my @terms = split /([,;])/, $str;

we get in @terms all of: it , is ; surely


Also by construction, it contains what the regex matched at even indices. So for all other parts we can fetch elements at odd indices

my @other_than_matched_parts = @all_parts[ grep { not $_ & 1 } 0..$#all_parts ];
like image 85
zdim Avatar answered Oct 31 '25 07:10

zdim


Checking if a comma , is within brackets/parenthesis e.g.

[(,),],[abc,(def,[ghi,],),],[(,),]
      ^                    ^

means that the pattern must be aware when exactly each of those brackets/parenthesis were opened and closed in a balanced way, so not just e.g [([] because it should be [([])].

Here is an alternative solution that doesn't solve your problem directly but might be a step closer.

  1. Match either of the following:

    a. Comma

    b. A group enclosed in an outer [] or (). See Regular expression to match balanced parentheses

  2. Filter out 1.b

Regex pattern:

(?:\((?>[^()]|(?R))*\)|\[(?>[^\[\]]|(?R))*\]|,)

enter image description here

For this string, the matches are as pointed out:

A, An(hi, world[hello, (hi , world) world]) and this, is that, for [the, one (in, here, [is not,])] and last,here!
 ^   ^------------------------------------^         ^        ^     ^------------------------------^         ^
  • So it didn't capture any commas inside any of those bracket/parenthesis groups as it captured them as a whole. Now, you have the commas at the outer level.
like image 27
Niel Godfrey Ponciano Avatar answered Oct 31 '25 08:10

Niel Godfrey Ponciano