I came up with a regex string that parses the given text into 3 categories:
Like this:
\[.+?\]|\(.+?\)|[\w+ ?]+
My intention is to use the outermost operator only. So, given a(b[c]d)e
, the split is going to be:
a || (b[c]d) || e
It works fine given parentheses inside brackets, or brackets inside parentheses, but breaks down when there are brackets inside brackets and parentheses inside parentheses. For example, a[b[c]d]e
is split as
a || [b[c] || d || ] || e.
Is there any way to handle this using regex alone, not resorting to using code to count number of open/closed parentheses? Thanks!
Standard1 regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.
1 I said standard, but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:
$str = "[hello [world]] abc [123] [xyz jkl]";
my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;
foreach (@matches) {
print "$_\n";
}
[hello [world]] abc [123] [xyz jkl]
EDIT: I see you're using Python; check out pyparsing
.
Well, once you abandon the idea that parsing nested expressions should work at unlimited depth, one can use regular expressions just fine by specifying a maximum depth in advance. Here is how:
def nested_matcher (n):
# poor man's matched paren scanning, gives up after n+1 levels.
# Matches any string with balanced parens or brackets inside; add
# the outer parens yourself if needed. Nongreedy. Does not
# distinguish parens and brackets as that would cause the
# expression to grow exponentially rather than linearly in size.
return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n
import re
p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]')
print p.findall('a(b[c]d)e')
print p.findall('a[b[c]d]e')
print p.findall('[hello [world]] abc [123] [xyz jkl]')
This will output
['a', '(b[c]d)', 'e']
['a', '[b[c]d]', 'e']
['[hello [world]]', ' abc ', '[123]', ' ', '[xyz jkl]']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With