Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive regex for matching everything in parenthesis (PCRE)

Tags:

regex

pcre

I am surprised to not easily find a similar question with an answer on SO. I would like to match everything in some functions. The idea is to remove the functions which are useless.

foo(some (content)) --> some (content)

So I am trying to match everything in the function call which can include parenthesis. Here is my PCRE regex:

(?<name>\w+)\s*\(\K
(?<e>
     [^()]+
     |
     [^()]*
         \((?&e)\)
     [^()]*
)*
(?=\))

https://regex101.com/r/gfMAIM/1

Unfortunately it doesn't work and I don't really understand why.

like image 403
nowox Avatar asked Oct 12 '25 05:10

nowox


2 Answers

Your Group e pattern does not do the right job, currently, it matches parentheses with 1 depth level as you only recursed the e pattern once. It needs to match as many (...) substrings as there are present, and thus, the subroutine pattern needs to be inside a * or + quantified group, and it can even be "simplified" to (?<e>[^()]*(?:\((?&e)\)[^()]*)*).

Note that your Group e pattern is equal to (?<e>[^()]+|\((?&e)\))*. [^()]* around \((?&e)\) are redundant since the [^()]+ alternative will consume the chars other than ( and ) on the current depth level.

Also, you quantified the Group e pattern making it a repeated capturing group that only keeps the text matched during the last iteration.

You may use

(?<name>\w+)\s*\(\K(?<e>[^()]*(?:\((?&e)\)[^()]*)*)(?=\))

See the regex demo

Details

  • (?<name>\w+)\s*\(\K - 1+ word chars, 0+ whitespaces and ( that are omitted from the match
  • (?<e> - start of Group e
    • [^()]* - 0+ chars other than ( and )
    • (?: - start of a non-capturing group:
      • \( - a ( char
      • (?&e) - Group e pattern recursed
      • \) - a )
      • [^()]* - 0+ chars other than ( and )
    • )* - 0 or more repetitions
  • ) - end of e group
  • (?=\)) - a ) must be immediately to the right of the current location.
like image 122
Wiktor Stribiżew Avatar answered Oct 14 '25 19:10

Wiktor Stribiżew


The following regex does the matching without taking extra steps:

(?<name>\w+)\s*(\((?<e>([^()]*+|(?2))+)\))

See live demo here

But that doesn't match following strings that contain unbalanced parentheses in a quoted string:

  • foo(bar = ')')
  • foo(bar(john = "(Doe..."))

So what you should look for is:

(?<name>\w+)\s*(\((?<e>([^()'"]*+|"(?>[^"\\]*+|\\.)*"|'(?>[^'\\]*+|\\.)*'|(?2))+)\))

See live demo here

Regex breakdown:

  • (?<name>\w+)\s* Match function name and trailing spaces
  • ( Start of a cluster
    • \( Match a literal (
    • (?<e> Start of named capturing group e
      • ( Start of capturing group #2
        • [^()'"]*+ Match any thing except ()'"
        • | Or
        • "(?>[^"\\]*+|\\.)*" Match any thing between double quotes
        • | Or
        • '(?>[^'\\]*+|\\.)*' Match any thing between single quotes
        • | Or
        • (?2) Recurse second capturing group
      • )+ Repeat as much as possible, at least once
    • ) End of capturing group
    • \) Match ) literally
  • ) End of capturing group
like image 44
revo Avatar answered Oct 14 '25 18:10

revo