Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle nested parentheses with regex?

Tags:

python

regex

I came up with a regex string that parses the given text into 3 categories:

  • in parentheses
  • in brackets
  • neither.

Like this:

\[.+?\]|\(.+?\)|[\w+ ?]+

My intention is to use the outermost operator only. So, given a(b[c]d)e, the split is going to be:

a || (b[c]d) || e

It works fine given parentheses inside brackets, or brackets inside parentheses, but breaks down when there are brackets inside brackets and parentheses inside parentheses. For example, a[b[c]d]e is split as

a || [b[c] || d || ] || e.

Is there any way to handle this using regex alone, not resorting to using code to count number of open/closed parentheses? Thanks!

like image 241
Minas Abovyan Avatar asked Jun 29 '13 20:06

Minas Abovyan


2 Answers

Standard1 regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.


1 I said standard, but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:

$str = "[hello [world]] abc [123] [xyz jkl]";

my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;

foreach (@matches) {
    print "$_\n";
}
[hello [world]]
abc
[123]
[xyz jkl]

EDIT: I see you're using Python; check out pyparsing.

like image 198
arshajii Avatar answered Oct 22 '22 21:10

arshajii


Well, once you abandon the idea that parsing nested expressions should work at unlimited depth, one can use regular expressions just fine by specifying a maximum depth in advance. Here is how:

def nested_matcher (n):
    # poor man's matched paren scanning, gives up after n+1 levels.
    # Matches any string with balanced parens or brackets inside; add
    # the outer parens yourself if needed.  Nongreedy.  Does not
    # distinguish parens and brackets as that would cause the
    # expression to grow exponentially rather than linearly in size.
    return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n

import re

p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]')
print p.findall('a(b[c]d)e')
print p.findall('a[b[c]d]e')
print p.findall('[hello [world]] abc [123] [xyz jkl]')

This will output

['a', '(b[c]d)', 'e']
['a', '[b[c]d]', 'e']
['[hello [world]]', ' abc ', '[123]', ' ', '[xyz jkl]']
like image 1
user3489112 Avatar answered Oct 22 '22 19:10

user3489112