Handling complex parentheses structures to get the expected data

Question

We have data from a REST API call stored in an output file that looks as follows:

Sample Input File:

test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))

Expected Output:

bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

Conditions to Derive the Expected Output:

Always consider the last occurrence of parentheses () in each line. We need to extract the values within this last, outermost pair of parentheses.
Inside the last occurrence of (), extract all values that appear before each occurrence of nested parentheses ().
Eg: test test123 - test (bla bla1 (On chutti)) last parenthesis starts from (bla to till chutti)) so I need bla bla1 since its before inner (On chutti). So look for the last parenthesis and then inside how many pair of parenthesis comes we need to get data before them, eg: in line testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan)) needed is Ranbir kapoor and Milkha Singh.

Attempted Regex: I tried using the following regular expression on Working Demo of regex:

Regex:

^(?:^[^(]+$[^)]+$ $([^(]+)\([^)]+$\))|[^(]+$([^(]+)\([^)]+$,\s([^$]+)\([^)]+$\s$[^$]+\)\)|(?:(?:.*?)$(.*?)\(.*?$\))|(?:[^(]+$([^)]+)$)$

The Regex that I have tried is working fine but I want to improve it with the advice of experts here.

Preferred Languages: Looking to improve this regex OR a Python, or an awk answer is also ok. I myself will also try to add an awk answer.

Ed Morton · Accepted Answer

Any time you're considering using a lengthy and/or complicated regexp to try to solve a problem, keep in mind the quote:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Using any awk:

$ cat tst.awk
{
    rec = $0
    while ( match(rec, /\([^()]*)/) ) {
        tgt = substr($0,RSTART+1,RLENGTH-2)
        rec = substr(rec,1,RSTART-1) RS substr(rec,RSTART+1,RLENGTH-2) RS substr(rec,RSTART+RLENGTH)
    }
    gsub(/ *\([^()]*) */, "", tgt)
    print tgt
}

$ awk -f tst.awk file
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

I'm saving a copy of $0 in rec and then in the loop I'm converting every (foo) inside rec to \nfoo\n (assuming the default RS and that the RS cannot be present in a RS-separated record) and also saving the foo from $0 (to retain the possibly nested original ( and ) pairs) in the variable tgt. So when the loop ends tgt contains the last foo substring that was present in this input record, e.g. Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan). Then with the final gsub() I remove all (...) substrings from tgt, including any surrounding blanks, leaving just the desired output.

If you can ever have more levels of parenthesised strings remaining in tgt than just 1 level deep, just change gsub(/ *\([^()]*) */, "", tgt) to while ( gsub(/ *\([^()]*) */, "", tgt) );.

anubhava · Answer

Purely based on your shown input and your comments reflecting that you need to capture 1 or 2 values per line, here is an optimized regex solution:

^(?:$[^)(]*$|[^()])*$([^)(]+)(?:\([^)(]*$[, ]*(?:([^)(]+))?)?

RegEx Demo

RegEx Details:

This regex solution does the following:

match everythng before last (...) then match ( then
1st group: match name that must not have ( and ) then
optional match of (...) or comma/space then
2nd group: match name that must not have ( and )

Further Details:

^: Start
(?:: Start non-capture group
- $[^\n)(]*$ : Match any pair of (...) text
- |: OR
- [^()\n]: Match any character that are not (, ) and \n
)*: End non-capture group. Repeat this 0 or more times
$: Match last (
([^)(\n]+): 1st capture group that matches text with 1+ characters that are not (, ) and \n
(?:: Start non-capture group 1
- \([^\n)(]*$: Match any pair of (...) text
- [, ]*: Match 0 or more of space or comma characters
- (?:: Start non-capture group 2
  - ([^)(\n]+): 2nd capture group that matches text with 1+ characters that are not (, ) and \n
- )?: End non-capture group 2. ? makes this an optional match
)?: End non-capture group 1. ? makes this an optional match

Handling complex parentheses structures to get the expected data

Tags:

python

regex

awk

RavinderSingh13

2 Answers

Ed Morton

anubhava

Recent Activity

Donate For Us

Handling complex parentheses structures to get the expected data

Tags:

python

regex

awk

RavinderSingh13

2 Answers

Ed Morton

anubhava

Related questions

Recent Activity

Donate For Us