Two related questions. I have vectors of text data such as <pre class="prettyprint"><code>"a(b)jk(p)" "ipq" "e(ijkl)" </code></pre> and want to easily separate it into a vector containing the text OUTSIDE the parentheses: <pre class="prettyprint"><code>"ajk" "ipq" "e" </code></pre> and a vector containing the text INSIDE the parentheses: <pre class="prettyprint"><code>"bp" "" "ijkl" </code></pre> Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.

Text outside the parenthesis <pre class="prettyprint"><code>> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)") > gsub("\\([^()]*\\)", "", x) [1] "ajk" "ipq" "e" </code></pre> Text inside the parenthesis <pre class="prettyprint"><code>> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)") > gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T) [1] "bp" "" "ijkl" </code></pre> The <code>(?<=\\()[^()]*(?=\\))</code> matches all the characters which are present inside the brackets and then the following <code>(*SKIP)(*F)</code> makes the match to fail. Now it tries to execute the pattern which was just after to <code>|</code> symbol against the remaining string. So the dot <code>.</code> matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets. <pre class="prettyprint"><code>> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T) [1] "bp" "" "ijkl" </code></pre> This regex would capture all the characters which are present inside the brackets and matches all the other characters. <code>|.</code> or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.

The <code>rm_round</code> function in the qdapRegex package I maintain was born to do this: First we'll get and load the package via pacman <pre class="prettyprint"><code>if (!require("pacman")) install.packages("pacman") pacman::p_load(qdapRegex) </code></pre> ## Then we can use it to remove and extract the parts you want: <pre class="prettyprint"><code>x <-c("a(b)jk(p)", "ipq", "e(ijkl)") rm_round(x) ## [1] "ajk" "ipq" "e" rm_round(x, extract=TRUE) ## [[1]] ## [1] "b" "p" ## ## [[2]] ## [1] NA ## ## [[3]] ## [1] "ijkl" </code></pre> To condense <code>b</code> and <code>p</code> use: <pre class="prettyprint"><code>sapply(rm_round(x, extract=TRUE), paste, collapse="") ## [1] "bp" "NA" "ijkl" </code></pre>

Extract text in parentheses in R

Tags:

string

text

r

vector

stringr

Two related questions. I have vectors of text data such as

"a(b)jk(p)"  "ipq"  "e(ijkl)"

and want to easily separate it into a vector containing the text OUTSIDE the parentheses:

"ajk"  "ipq"  "e"

and a vector containing the text INSIDE the parentheses:

"bp"   ""  "ijkl"

Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.

338

asked Mar 10 '15 02:03

user2817329

2 Answers

Text outside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"

Text inside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp"   ""     "ijkl"

The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.

> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp"   ""     "ijkl"

This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.

161

answered Oct 11 '22 13:10

Avinash Raj

The rm_round function in the qdapRegex package I maintain was born to do this:

First we'll get and load the package via pacman

if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)

## Then we can use it to remove and extract the parts you want:

x <-c("a(b)jk(p)", "ipq", "e(ijkl)")

rm_round(x)

## [1] "ajk" "ipq" "e" 

rm_round(x, extract=TRUE)

## [[1]]
## [1] "b" "p"
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] "ijkl"

To condense b and p use:

sapply(rm_round(x, extract=TRUE), paste, collapse="")

## [1] "bp"   "NA"   "ijkl"

answered Oct 11 '22 14:10

Tyler Rinker

Related questions
                            
                                Is there a way to install R packages using emacs?
                            
                                Removing columns with missing values
                            
                                Efficiently average the second column by intervals defined by the first column
                            
                                Which algorithm I can use to find common adjacent words/ pattern recognition?
                            
                                retrieve row and column name of particular cell in R
                            
                                for() loop step width
                            
                                Initialize a list of matrices in R
                            
                                How to install the fftw3 package of R in ubuntu 12.04?
                            
                                How can I print a table in R with ascii, html, or markdown formatting?
                            
                                "package ‘mgcv’ could not be loaded" only in RStudio
                            
                                Dynamic arguments to expand.grid
                            
                                How to subset data.frames stored in a list?
                            
                                How to remove empty columns in R?
                            
                                Remove zeros in the start and end of a vector
                            
                                Specifying the scale for the density in ggplot2's stat_density2d
                            
                                Function/instruction to count number of times a value has already been seen
                            
                                The fastest way to convert numeric to character in R
                            
                                How do I use a macro variable in R? (Similar to %LET in SAS)
                            
                                Setting *only* column names in Rcpp
                            
                                Add axis tick-marks on top and to the right to a ggplot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With