Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing multiple occurrences of a character or string inside parentheses in R

Tags:

regex

r

I am trying to replace commas within all sets of parentheses with a semicolon, but not change any commas outside of the parentheses.

So, for example:

"a, b, c (1, 2, 3), d, e (4, 5)"

should become:

"a, b, c (1; 2; 3), d, e (4; 5)"

I have started attempting this with gsub, but I am having a really hard time understanding/figuring out what how to identify those commas within the parentheses.

I would call myself an advanced beginner with R, but with regular expressions and text manipulations, a total noob. Any help you can provide would be great.

like image 883
scrrd Avatar asked Jul 23 '15 13:07

scrrd


1 Answers

The simplest solution

A most common workaround that will work in case all parentheses are balanced:

,(?=[^()]*\))

See the regex demo. R code:

a <- "a, b, c (1, 2, 3), d, e (4, 5)"
gsub(",(?=[^()]*\\))", ";", a, perl=T)
## [1] "a, b, c (1; 2; 3), d, e (4; 5)"

See IDEONE demo

The regex matches...

  • , - a comma if...
  • (?=[^()]*\)) - it is followed by 0 or more characters other than ( or ) (with [^()]*) and a literal ).

Alternative solutions

If you need to make sure only commas inside the closest open and close parentheses are replaced, it is safer to use a gsubfn based approach:

library(gsubfn)
x <- 'a, b, c (1, 2, 3), d, e (4, 5)'
gsubfn('\\(([^()]*)\\)', function(match) gsub(',', ';', match, fixed=TRUE), x, backref=0)
## => [1] "a, b, c (1; 2; 3), d, e (4; 5)"

Here, \(([^()]*)\) matches (, then 0+ chars other than ( and ) and then ), and after that the match found is passed to the anonymous function where all , chars are replaced with semi-colons using gsub.

If you need to perform this replacement inside balanced parentheses with unknown level depth use a PCRE regex with gsubfn:

x1 <- 'a, b, c (1, (2, (3, 4)), 5), d, e (4, 5)'
gsubfn('\\(((?:[^()]++|(?R))*)\\)', function(match) gsub(',', ';', match, fixed=TRUE), x1, backref=0, perl=TRUE)
## => [1] "a, b, c (1; (2; (3; 4)); 5), d, e (4; 5)"

Pattern details

\(             # Open parenthesis
  (            # Start group 1
   (?:         # Start of a non-capturing group:
     [^()]++   # Any 1 or more chars other than '(' and ')'
     |         #   OR
      (?R)     # Recursively match the entire pattern
   )*          # End of the non-capturing group and repeat it zero or more times
  )            # End of Group 1 (its value will be passed to the `gsub` via `match`)
\)             # A literal ')'
like image 130
Wiktor Stribiżew Avatar answered Sep 22 '22 14:09

Wiktor Stribiżew