Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a recursive regex that matches nested parentheses?

Tags:

regex

php

I am trying to write a regexp which matches nested parentheses, e.g.:

"(((text(text))))(text()()text)(casual(characters(#$%^^&&#^%#@!&**&#^*!@#^**_)))"

A string like this should be matched, cause all the nested parentheses are closed, instead:

"(((text)))(text)(casualChars*#(!&#*(!))"

Should not, or better, should match at least the first "(((text)))(text)" part.

Actually, my regexp is:

 $regex = '/( (  (\() ([^[]*?)  (?R)?  (\))  ){0,}) /x';

But it does not work properly as I am expecting. How to fix that? Where am I wrong? Thanks!

like image 383
tonix Avatar asked Dec 13 '13 14:12

tonix


People also ask

How can you tell a regex that you want it to fit real parentheses and periods?

How would you specify that you want a regex to match actual parentheses and period characters? Periods and parentheses can be escaped with a backslash: \., \(, and \).

Can a regex identify correct bracketing?

Regular expressions can't count brackets.

How do you match square brackets in regex?

[[\]] will match either bracket. In some regex dialects (e.g. grep) you can omit the backslash before the ] if you place it immediately after the [ (because an empty character class would never be useful): [][] .

Can regular expressions be recursive?

Recursion is mostly used to match balanced constructs. The regex \([^()]*+(?:(? R)[^()]*+)*+\) matches a pair of parentheses with all parentheses between them correctly nested, regardless of how many nested pairs there are or how deeply they are nested. This regex satisfies both requirements for valid recursion.


2 Answers

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left  = '{';/* put YOUR left delimiter here.  */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left  = preg_quote( $delimiter_left,  $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern         = $delimiter_wrap . $delimiter_left
                 . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
                 . $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );
like image 147
Axel Avatar answered Sep 22 '22 04:09

Axel


This pattern works:

$pattern = '~ \( (?: [^()]+ | (?R) )*+ \) ~x';

The content inside parenthesis is simply describe:

"all that is not parenthesis OR recursion (= other parenthesis)" x 0 or more times

If you want to catch all substrings inside parenthesis, you must put this pattern inside a lookahead to obtain all overlapping results:

$pattern = '~(?= ( \( (?: [^()]+ | (?1) )*+ \) ) )~x';
preg_match_all($pattern, $subject, $matches);
print_r($matches[1]);

Note that I have added a capturing group and I have replaced (?R) by (?1):

(?R) -> refers to the whole pattern (You can write (?0) too)
(?1) -> refers to the first capturing group

What is this lookahead trick?

A subpattern inside a lookahead (or a lookbehind) doesn't match anything, it's only an assertion (a test). Thus, it allows to check the same substring several times.

If you display the whole pattern results (print_r($matches[0]);), you will see that all results are empty strings. The only way to obtain the substrings found by the subpattern inside the lookahead is to enclose the subpattern in a capturing group.

Note: the recursive subpattern can be improved like this:

\( [^()]*+ (?: (?R) [^()]* )*+ \)
like image 43
Casimir et Hippolyte Avatar answered Sep 20 '22 04:09

Casimir et Hippolyte