Hello my dear teachers/ fellow R users,
I have recently started learning regex in earnest and recently I've come across a case where we only would like to keep paired parentheses ()
and omit the unpaired ones. Here is my sample data:
structure(list(t1 = c("Book (Pg 1)", "(Website) Online)", "Journal: 2018)",
"Book1 (pg 2) book 3 (pg4) something)")), class = "data.frame", row.names = c(NA,
-4L))
And my desired output would be like:
structure(list(t1 = c("Book (Pg 1)", "(Website) Online", "Journal: 2018",
"Book1 (pg 2) book 3 (pg4) something")), class = "data.frame", row.names = c(NA,
-4L))
I myself have managed to do it with the following code, but I thought there is surely a more efficient way of going about it. As a matter of fact I would like to learn other ways of doing getting the similar result:
test$t2 <- gsub("([(]?.*[)]?\\s+[^(]\\w+)[)]|([(].*[)])", "\\1\\2", test$t1)
test
t1 t2
1 Book (Pg 1) Book (Pg 1)
2 (Website) Online) (Website) Online
3 Journal: 2018) Journal: 2018
4 Book1 (pg 2) book 3 (pg4) something) Book1 (pg 2) book 3 (pg4) something
Another issue with my regex is when I swap the places of RHS
and LHS
of |
it does not lead to the desired result which I'm curious why.
I would be grateful if you could give a little bit of explanation on solving these sorts of problem.
Thank you very much in advance.
Pretty straight-forward:
\([^()]*\)(*SKIP)(*FAIL)|[()]+
with the perl = T
parameter.
A bit of an explanation:
\([^()]*\)(*SKIP)(*FAIL) # match any balanced parenthesis construct and let the engine skip it
| # or
[()]+ # match single parentheses
Read more on backtracking control verbs and see a demo on regex101.com.
You can use
> gsub("\\([^()]*\\)(*SKIP)(*F)|[()]", "", df$t1, perl=TRUE)
[1] "Book (Pg 1)" "(Website) Online"
[3] "Journal: 2018" "Book1 (pg 2) book 3 (pg4) something"
See the R online demo and the regex demo.
Details
\([^()]*\)(*SKIP)(*F)
- (
, zero or more chars other than (
and )
and then a )
char, and this matched text is discarded and the next match is searched for starting from the failure position|
- or[()]
- matches a (
or )
chars.If you need to skip balanced, nested parentheses, you can use
gsub("(\\((?:[^()]++|(?-1))*\\))(*SKIP)(*F)|[()]", "", df$t1, perl=TRUE)
Here, (\((?:[^()]++|(?-1))*\))(*SKIP)(*F)
matches and skips any substrings in between nested parentheses (like (aa (bb(c)x)x)
) and |[()]
matches any (
and )
in other contexts.
See this regex demo.
The (*SKIP)
and (*F)
(=(*FAIL)
) PCRE verbs mean:
(*SKIP)
- the engine advances to the string position corresponding to the place in the pattern where (*SKIP)
was encountered, and starts a new match attempt at that position, or the engine skips to the string position corresponding to where (*SKIP)
was matched—potentially saving a lot of fruitless match attempts(*F)
- signals failure to the regex engine triggering backtracking if applicable (here, it will because of the |
alternation operator). Note that (*F)
is the same as (?!)
, i.e. fail if there is anything immediately on the right.(\\([^)]*\\))*|\\)*
regex may solve the problem
Regex explanation may be seen here
()
and second isn't\\(
matches literal (
[^)]
matches everything except literal )
*
matches previous token continuously. Since )
is not matched it reaches upto end of first pair of )
\\)
matches )
*
after group parenthesis matches these any number of times)
\\1
test <- structure(list(t1 = c("Book (Pg 1)", "(Website) Online)", "Journal: 2018)",
"Book1 (pg 2) book 3 (pg4) something)")), class = "data.frame", row.names = c(NA,
-4L))
test
#> t1
#> 1 Book (Pg 1)
#> 2 (Website) Online)
#> 3 Journal: 2018)
#> 4 Book1 (pg 2) book 3 (pg4) something)
gsub('(\\([^)]*\\))*|\\)*', '\\1', test$t1)
#> [1] "Book (Pg 1)"
#> [2] "(Website) Online"
#> [3] "Journal: 2018"
#> [4] "Book1 (pg 2) book 3 (pg4) something"
Created on 2021-07-04 by the reprex package (v2.0.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With