How to only remove single parenthesis and keep the paired ones

Question

Hello my dear teachers/ fellow R users,

I have recently started learning regex in earnest and recently I've come across a case where we only would like to keep paired parentheses () and omit the unpaired ones. Here is my sample data:

structure(list(t1 = c("Book (Pg 1)", "(Website) Online)", "Journal: 2018)", 
"Book1 (pg 2) book 3 (pg4)  something)")), class = "data.frame", row.names = c(NA, 
-4L))

And my desired output would be like:

structure(list(t1 = c("Book (Pg 1)", "(Website) Online", "Journal: 2018", 
"Book1 (pg 2) book 3 (pg4)  something")), class = "data.frame", row.names = c(NA, 
-4L))

I myself have managed to do it with the following code, but I thought there is surely a more efficient way of going about it. As a matter of fact I would like to learn other ways of doing getting the similar result:

test$t2 <- gsub("([(]?.*[)]?\s+[^(]\w+)[)]|([(].*[)])", "\1\2", test$t1)
test

                                     t1                                   t2
1                           Book (Pg 1)                          Book (Pg 1)
2                     (Website) Online)                     (Website) Online
3                        Journal: 2018)                        Journal: 2018
4 Book1 (pg 2) book 3 (pg4)  something) Book1 (pg 2) book 3 (pg4)  something

Another issue with my regex is when I swap the places of RHS and LHS of | it does not lead to the desired result which I'm curious why. I would be grateful if you could give a little bit of explanation on solving these sorts of problem.

Thank you very much in advance.

Jan · Accepted Answer

Pretty straight-forward:

$[^()]*$(*SKIP)(*FAIL)|[()]+

with the perl = T parameter.

A bit of an explanation:

$[^()]*$(*SKIP)(*FAIL) # match any balanced parenthesis construct and let the engine skip it
|                        # or
[()]+                    # match single parentheses

Read more on backtracking control verbs and see a demo on regex101.com.

Wiktor Stribiżew · Answer

You can use

> gsub("$[^()]*$(*SKIP)(*F)|[()]", "", df$t1, perl=TRUE)
[1] "Book (Pg 1)"                          "(Website) Online"                    
[3] "Journal: 2018"                        "Book1 (pg 2) book 3 (pg4)  something"

See the R online demo and the regex demo.

Details

$[^()]*$(*SKIP)(*F) - (, zero or more chars other than ( and ) and then a ) char, and this matched text is discarded and the next match is searched for starting from the failure position
| - or
[()] - matches a ( or ) chars.

If you need to skip balanced, nested parentheses, you can use

gsub("($(?:[^()]++|(?-1))*$)(*SKIP)(*F)|[()]", "", df$t1, perl=TRUE)

Here, ($(?:[^()]++|(?-1))*$)(*SKIP)(*F) matches and skips any substrings in between nested parentheses (like (aa (bb(c)x)x)) and |[()] matches any ( and ) in other contexts.

See this regex demo.

The (*SKIP) and (*F) (=(*FAIL)) PCRE verbs mean:

(*SKIP) - the engine advances to the string position corresponding to the place in the pattern where (*SKIP) was encountered, and starts a new match attempt at that position, or the engine skips to the string position corresponding to where (*SKIP) was matched—potentially saving a lot of fruitless match attempts
(*F) - signals failure to the regex engine triggering backtracking if applicable (here, it will because of the | alternation operator). Note that (*F) is the same as (?!), i.e. fail if there is anything immediately on the right.

AnilGoyal · Answer

($[^)]*$)*|\)* regex may solve the problem

Regex explanation may be seen here

2 alternatives matched. First captured and second not captured. Hence first wrapped in () and second isn't
$ matches literal (
[^)] matches everything except literal )
* matches previous token continuously. Since ) is not matched it reaches upto end of first pair of )
$ matches )
* after group parenthesis matches these any number of times
second alternative matches )
all matches replaced by first captured group \1
hence, the result you want

test <- structure(list(t1 = c("Book (Pg 1)", "(Website) Online)", "Journal: 2018)", 
                              "Book1 (pg 2) book 3 (pg4)  something)")), class = "data.frame", row.names = c(NA, 
                                                                                                             -4L))
test
#>                                      t1
#> 1                           Book (Pg 1)
#> 2                     (Website) Online)
#> 3                        Journal: 2018)
#> 4 Book1 (pg 2) book 3 (pg4)  something)

gsub('($[^)]*$)*|\)*', '\1', test$t1)

#> [1] "Book (Pg 1)"                         
#> [2] "(Website) Online"                    
#> [3] "Journal: 2018"                       
#> [4] "Book1 (pg 2) book 3 (pg4)  something"

^{Created on 2021-07-04 by the reprex package (v2.0.0)}

How to only remove single parenthesis and keep the paired ones

Tags:

regex

r

Anoushiravan R

3 Answers

Jan

Wiktor Stribiżew

AnilGoyal

Recent Activity

Donate For Us

How to only remove single parenthesis and keep the paired ones

Tags:

regex

r

Anoushiravan R

3 Answers

Jan

Wiktor Stribiżew

AnilGoyal

Related questions

Recent Activity

Donate For Us