I have a little knowledge about RegEx, but at the moment, it is far above of my abilities.
I'm needing help to find the text/expression immediately after the last open-parenthesis that doesn't have a matching close-parenthesis.
It is for CallTip of a open source software (Object Pascal) in development.
Below some examples:
------------------------------------
Text I need
------------------------------------
aaa(xxx xxx
aaa(xxx, xxx
aaa(xxx, yyy xxx
aaa(y=bbb(xxx) y=bbb(xxx)
aaa(y <- bbb(xxx) y <- bbb(xxx)
aaa(bbb(ccc(xxx xxx
aaa(bbb(x), ccc(xxx xxx
aaa(bbb(x), ccc(x) bbb(x)
aaa(bbb(x), ccc(x), bbb(x)
aaa(?, bbb(?? ??
aaa(bbb(x), ccc(x)) ''
aaa(x) ''
aaa(bbb( ''
------------------------------------
For all text above the RegEx proposed by @Bohemian
(?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(?=[ ,]|$)(?! <-)(?<! <-)
matches all cases.
For the below (I found these cases when implementing the RegEx in the software) not
------------------------------------
New text I need
------------------------------------
aaa(bbb(x, y) bbb(x, y)
aaa(bbb(x, y, z) bbb(x, y, z)
------------------------------------
Is it possible to write a RegEx (PCRE) for these situations?
In an previous post (RegEx: Word immediately before the last opened parenthesis) Alan Moore (many thanks newly) help me to find the text immediately before the last open-parenthesis with the RegEx below:
\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)
However, I was not able to make an appropriate adjustment to match immediately after.
Anyone can help please?
This is similar to this problem. And since you are using PCRE, using the recursion syntax, there is actually a solution.
/
(?(DEFINE) # define a named capture for later convenience
(?P<parenthesized> # define the group "parenthesized" which matches a
# substring which contains correctly nested
# parentheses (it does not have to be enclosed in
# parentheses though)
[^()]* # match arbitrarily many non-parenthesis characters
(?: # start non capturing group
[(] # match a literal opening (
(?P>parenthesized) # recursively call this "parenthesized" subpattern
# i.e. make sure that the contents of these literal ()
# are also correctly parenthesized
[)] # match a literal closing )
[^()]* # match more non-parenthesis characters
)* # repeat
) # end of "parenthesized" pattern
) # end of DEFINE sequence
# Now the actual pattern begins
(?<=[(]) # ensure that there is a literal ( left of the start
# of the match
(?P>parenthesized)? # match correctly parenthesized substring
$ # ensure that we've reached the end of the input
/x # activate free-spacing mode
The gist of this pattern is obviously the parenthesized
subpattern. I should maybe elaborate a bit more on that. It's structure is this:
(normal* (?:special normal*)*)
Where normal
is [^()]
and special
is [(](?P>parenthesized)[)]
. This technique is called "unrolling-the-loop". It's used to match anything that has the structure
nnnsnnsnnnnsnnsnn
Where n
is matched by normal
and s
is matched by special
.
In this particular case, things are a bit more complicated though, because we are also using recursion. (?P>parenthesized)
recursively uses the parenthesized
pattern (which it is part of). You can view the (?P>...)
syntax a bit like a backreference - except the engine does not try to match what the group ...
matched, but instead applies it's subpattern again.
Also note that my pattern will not give you an empty string for correctly parenthesized patterns, but will fail. You could fix this, by leaving out the lookbehind. The lookbehind is actually not necessary, because the engine will always return the left-most match.
EDIT: Judging by two of your examples, you don't actually want everything after the last unmatched parenthesis, but only everything until the first comma. You can use my result and split on ,
or try Bohemian's answer.
Further reading:
(?(DEFINE)...)
is actually abusing another feature called conditional patterns. The PCRE man pages explain how it works - just search the pages for "Defining subpatterns for use by reference only".EDIT: I noticed that you mentioned in your question that you are using Object Pascal. In that case you are probably not actually using PCRE, which means there is no support for recursion. In that case there can be no full regex solution to the problem. If we impose a limitation like "there can only be one more nesting level after the last unmatched parenthesis" (as in all your examples), then we can come up with a solution. Again, I'll use "unrolling-the-loop" to match substrings of the form xxx(xxx)xxx(xxx)xxx
.
(?<=[(]) # make sure we start after an opening (
(?= # lookahead checks that the parenthesis is not matched
[^()]*([(][^()]*[)][^()]*)*
# this matches an arbitrarily long chain of parenthesized
# substring, but allows only one nesting level
$ # make sure we can reach the end of the string like this
) # end of lookahead
[^(),]*([(][^()]*[)][^(),]*)*
# now actually match the desired part. this is the same
# as the lookahead, except we do not allow for commas
# outside of parentheses now, so that you only get the
# first comma-separated part
If you ever add an input example like aaa(xxx(yyy())
where you want to match xxx(yyy())
then this approach will not match it. In fact, no regex that does not use recursion can handle arbitrary nesting levels.
Since your regex flavor doesn't support recursion, you are probably better off without using regex at all. Even if my last regex matches all your current input examples, it's really convoluted and maybe not worth the trouble. How about this instead: walk the string character by character and maintain a stack of parenthesis positions. Then the following pseudocode gives you everything after the last unmatched (
:
while you can read another character from the string
if that character is "(", push the current position onto the stack
if that character is ")", pop a position from the stack
# you've reached the end of the string now
if the stack is empty, there is no match
else the top of the stack is the position of the last unmatched parenthesis;
take a substring from there to the end of the string
To then obtain everything up to the first unnested comma, you can walk that result again:
nestingLevel = 0
while you can read another character from the string
if that character is "," and nestingLevel == 0, stop
if that character is "(" increment nestingLevel
if that character is ")" decrement nestingLevel
take a substring from the beginning of the string to the position at which
you left the loop
These two short loops will be much easier for anyone else to understand in the future and are a lot more flexible than a regex solution (at least one without recursion).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With