Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to only remove single parenthesis and keep the paired ones

Tags:

regex

r

Hello my dear teachers/ fellow R users,

I have recently started learning regex in earnest and recently I've come across a case where we only would like to keep paired parentheses () and omit the unpaired ones. Here is my sample data:

structure(list(t1 = c("Book (Pg 1)", "(Website) Online)", "Journal: 2018)", 
"Book1 (pg 2) book 3 (pg4)  something)")), class = "data.frame", row.names = c(NA, 
-4L))

And my desired output would be like:

structure(list(t1 = c("Book (Pg 1)", "(Website) Online", "Journal: 2018", 
"Book1 (pg 2) book 3 (pg4)  something")), class = "data.frame", row.names = c(NA, 
-4L))

I myself have managed to do it with the following code, but I thought there is surely a more efficient way of going about it. As a matter of fact I would like to learn other ways of doing getting the similar result:

test$t2 <- gsub("([(]?.*[)]?\\s+[^(]\\w+)[)]|([(].*[)])", "\\1\\2", test$t1)
test

                                     t1                                   t2
1                           Book (Pg 1)                          Book (Pg 1)
2                     (Website) Online)                     (Website) Online
3                        Journal: 2018)                        Journal: 2018
4 Book1 (pg 2) book 3 (pg4)  something) Book1 (pg 2) book 3 (pg4)  something

Another issue with my regex is when I swap the places of RHS and LHS of | it does not lead to the desired result which I'm curious why. I would be grateful if you could give a little bit of explanation on solving these sorts of problem.

Thank you very much in advance.

like image 231
Anoushiravan R Avatar asked Jul 03 '21 20:07

Anoushiravan R


3 Answers

Pretty straight-forward:

\([^()]*\)(*SKIP)(*FAIL)|[()]+

with the perl = T parameter.


A bit of an explanation:

\([^()]*\)(*SKIP)(*FAIL) # match any balanced parenthesis construct and let the engine skip it
|                        # or
[()]+                    # match single parentheses

Read more on backtracking control verbs and see a demo on regex101.com.

like image 127
Jan Avatar answered Oct 04 '22 12:10

Jan


You can use

> gsub("\\([^()]*\\)(*SKIP)(*F)|[()]", "", df$t1, perl=TRUE)
[1] "Book (Pg 1)"                          "(Website) Online"                    
[3] "Journal: 2018"                        "Book1 (pg 2) book 3 (pg4)  something"

See the R online demo and the regex demo.

Details

  • \([^()]*\)(*SKIP)(*F) - (, zero or more chars other than ( and ) and then a ) char, and this matched text is discarded and the next match is searched for starting from the failure position
  • | - or
  • [()] - matches a ( or ) chars.

If you need to skip balanced, nested parentheses, you can use

gsub("(\\((?:[^()]++|(?-1))*\\))(*SKIP)(*F)|[()]", "", df$t1, perl=TRUE)

Here, (\((?:[^()]++|(?-1))*\))(*SKIP)(*F) matches and skips any substrings in between nested parentheses (like (aa (bb(c)x)x)) and |[()] matches any ( and ) in other contexts.

See this regex demo.

The (*SKIP) and (*F) (=(*FAIL)) PCRE verbs mean:

  • (*SKIP) - the engine advances to the string position corresponding to the place in the pattern where (*SKIP) was encountered, and starts a new match attempt at that position, or the engine skips to the string position corresponding to where (*SKIP) was matched—potentially saving a lot of fruitless match attempts
  • (*F) - signals failure to the regex engine triggering backtracking if applicable (here, it will because of the | alternation operator). Note that (*F) is the same as (?!), i.e. fail if there is anything immediately on the right.
like image 27
Wiktor Stribiżew Avatar answered Oct 04 '22 11:10

Wiktor Stribiżew


(\\([^)]*\\))*|\\)* regex may solve the problem

Regex explanation may be seen here

  • 2 alternatives matched. First captured and second not captured. Hence first wrapped in () and second isn't
  • \\( matches literal (
  • [^)] matches everything except literal )
  • * matches previous token continuously. Since ) is not matched it reaches upto end of first pair of )
  • \\) matches )
  • * after group parenthesis matches these any number of times
  • second alternative matches )
  • all matches replaced by first captured group \\1
  • hence, the result you want
test <- structure(list(t1 = c("Book (Pg 1)", "(Website) Online)", "Journal: 2018)", 
                              "Book1 (pg 2) book 3 (pg4)  something)")), class = "data.frame", row.names = c(NA, 
                                                                                                             -4L))
test
#>                                      t1
#> 1                           Book (Pg 1)
#> 2                     (Website) Online)
#> 3                        Journal: 2018)
#> 4 Book1 (pg 2) book 3 (pg4)  something)

gsub('(\\([^)]*\\))*|\\)*', '\\1', test$t1)

#> [1] "Book (Pg 1)"                         
#> [2] "(Website) Online"                    
#> [3] "Journal: 2018"                       
#> [4] "Book1 (pg 2) book 3 (pg4)  something"

Created on 2021-07-04 by the reprex package (v2.0.0)

like image 36
AnilGoyal Avatar answered Oct 04 '22 10:10

AnilGoyal