Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx: text immediately after the last opened parenthesis

Tags:

regex

delphi

I have a little knowledge about RegEx, but at the moment, it is far above of my abilities.

I'm needing help to find the text/expression immediately after the last open-parenthesis that doesn't have a matching close-parenthesis.

It is for CallTip of a open source software (Object Pascal) in development.

Below some examples:

------------------------------------
Text                  I need
------------------------------------
aaa(xxx               xxx
aaa(xxx,              xxx
aaa(xxx, yyy          xxx
aaa(y=bbb(xxx)        y=bbb(xxx)
aaa(y <- bbb(xxx)     y <- bbb(xxx)
aaa(bbb(ccc(xxx       xxx
aaa(bbb(x), ccc(xxx   xxx
aaa(bbb(x), ccc(x)    bbb(x)
aaa(bbb(x), ccc(x),   bbb(x)
aaa(?, bbb(??         ??
aaa(bbb(x), ccc(x))   ''
aaa(x)                ''
aaa(bbb(              ''
------------------------------------

For all text above the RegEx proposed by @Bohemian
(?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(?=[ ,]|$)(?! <-)(?<! <-)
matches all cases.

For the below (I found these cases when implementing the RegEx in the software) not
------------------------------------
New text              I need
------------------------------------
aaa(bbb(x, y)         bbb(x, y)
aaa(bbb(x, y, z)      bbb(x, y, z)
------------------------------------

Is it possible to write a RegEx (PCRE) for these situations?

In an previous post (RegEx: Word immediately before the last opened parenthesis) Alan Moore (many thanks newly) help me to find the text immediately before the last open-parenthesis with the RegEx below:

\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)

However, I was not able to make an appropriate adjustment to match immediately after.

Anyone can help please?

like image 733
jcfaria Avatar asked Mar 24 '23 14:03

jcfaria


1 Answers

This is similar to this problem. And since you are using PCRE, using the recursion syntax, there is actually a solution.

/
(?(DEFINE)                # define a named capture for later convenience
  (?P<parenthesized>      # define the group "parenthesized" which matches a
                          # substring which contains correctly nested
                          # parentheses (it does not have to be enclosed in
                          # parentheses though)
    [^()]*                # match arbitrarily many non-parenthesis characters
    (?:                   # start non capturing group
      [(]                 # match a literal opening (
      (?P>parenthesized)  # recursively call this "parenthesized" subpattern
                          # i.e. make sure that the contents of these literal ()
                          # are also correctly parenthesized
      [)]                 # match a literal closing )
      [^()]*              # match more non-parenthesis characters
    )*                    # repeat
  )                       # end of "parenthesized" pattern
)                         # end of DEFINE sequence

# Now the actual pattern begins

(?<=[(])                  # ensure that there is a literal ( left of the start
                          # of the match
(?P>parenthesized)?       # match correctly parenthesized substring
$                         # ensure that we've reached the end of the input
/x                        # activate free-spacing mode

The gist of this pattern is obviously the parenthesized subpattern. I should maybe elaborate a bit more on that. It's structure is this:

(normal* (?:special normal*)*)

Where normal is [^()] and special is [(](?P>parenthesized)[)]. This technique is called "unrolling-the-loop". It's used to match anything that has the structure

nnnsnnsnnnnsnnsnn

Where n is matched by normal and s is matched by special.

In this particular case, things are a bit more complicated though, because we are also using recursion. (?P>parenthesized) recursively uses the parenthesized pattern (which it is part of). You can view the (?P>...) syntax a bit like a backreference - except the engine does not try to match what the group ... matched, but instead applies it's subpattern again.

Also note that my pattern will not give you an empty string for correctly parenthesized patterns, but will fail. You could fix this, by leaving out the lookbehind. The lookbehind is actually not necessary, because the engine will always return the left-most match.

EDIT: Judging by two of your examples, you don't actually want everything after the last unmatched parenthesis, but only everything until the first comma. You can use my result and split on , or try Bohemian's answer.

Further reading:

  • PCRE Subpatterns (including named groups)
  • PCRE Recursion
  • "Unrolling-the-loop" was introduced by Jeffrey Friedl in his book Mastering Regular Expressions, but I think the post I linked above gives a good overview.
  • Using (?(DEFINE)...) is actually abusing another feature called conditional patterns. The PCRE man pages explain how it works - just search the pages for "Defining subpatterns for use by reference only".

EDIT: I noticed that you mentioned in your question that you are using Object Pascal. In that case you are probably not actually using PCRE, which means there is no support for recursion. In that case there can be no full regex solution to the problem. If we impose a limitation like "there can only be one more nesting level after the last unmatched parenthesis" (as in all your examples), then we can come up with a solution. Again, I'll use "unrolling-the-loop" to match substrings of the form xxx(xxx)xxx(xxx)xxx.

(?<=[(])         # make sure we start after an opening (
(?=              # lookahead checks that the parenthesis is not matched
  [^()]*([(][^()]*[)][^()]*)*
                 # this matches an arbitrarily long chain of parenthesized
                 # substring, but allows only one nesting level
  $              # make sure we can reach the end of the string like this
)                # end of lookahead
[^(),]*([(][^()]*[)][^(),]*)*
                 # now actually match the desired part. this is the same
                 # as the lookahead, except we do not allow for commas
                 # outside of parentheses now, so that you only get the
                 # first comma-separated part

If you ever add an input example like aaa(xxx(yyy()) where you want to match xxx(yyy()) then this approach will not match it. In fact, no regex that does not use recursion can handle arbitrary nesting levels.

Since your regex flavor doesn't support recursion, you are probably better off without using regex at all. Even if my last regex matches all your current input examples, it's really convoluted and maybe not worth the trouble. How about this instead: walk the string character by character and maintain a stack of parenthesis positions. Then the following pseudocode gives you everything after the last unmatched (:

while you can read another character from the string
    if that character is "(", push the current position onto the stack
    if that character is ")", pop a position from the stack
# you've reached the end of the string now
if the stack is empty, there is no match
else the top of the stack is the position of the last unmatched parenthesis;
     take a substring from there to the end of the string

To then obtain everything up to the first unnested comma, you can walk that result again:

nestingLevel = 0
while you can read another character from the string
    if that character is "," and nestingLevel == 0, stop
    if that character is "(" increment nestingLevel
    if that character is ")" decrement nestingLevel
take a substring from the beginning of the string to the position at which
  you left the loop

These two short loops will be much easier for anyone else to understand in the future and are a lot more flexible than a regex solution (at least one without recursion).

like image 123
Martin Ender Avatar answered Mar 31 '23 15:03

Martin Ender