Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text in parentheses in R

Two related questions. I have vectors of text data such as

"a(b)jk(p)"  "ipq"  "e(ijkl)"

and want to easily separate it into a vector containing the text OUTSIDE the parentheses:

"ajk"  "ipq"  "e"

and a vector containing the text INSIDE the parentheses:

"bp"   ""  "ijkl"

Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.

like image 338
user2817329 Avatar asked Mar 10 '15 02:03

user2817329


People also ask

What is parenthesis in R?

Round brackets (also known as "parenthesis") are used primarily when calling a function in R. Every function must be called using the round brackets. Some functions need additional information that must be provided to them inside the round brackets. This additional information is called the arguments of a function.

Are parentheses?

A parenthesis is a punctuation mark used to enclose information, similar to a bracket. The open parenthesis, which looks like (, is used to begin parenthetical text. The close parenthesis, ), denotes the end of parenthetical text. The plural of parenthesis is parentheses.


2 Answers

Text outside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"  

Text inside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp"   ""     "ijkl"

The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.

> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp"   ""     "ijkl"

This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.

like image 161
Avinash Raj Avatar answered Oct 11 '22 13:10

Avinash Raj


The rm_round function in the qdapRegex package I maintain was born to do this:

First we'll get and load the package via pacman

if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)

## Then we can use it to remove and extract the parts you want:

x <-c("a(b)jk(p)", "ipq", "e(ijkl)")

rm_round(x)

## [1] "ajk" "ipq" "e" 

rm_round(x, extract=TRUE)

## [[1]]
## [1] "b" "p"
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] "ijkl"

To condense b and p use:

sapply(rm_round(x, extract=TRUE), paste, collapse="")

## [1] "bp"   "NA"   "ijkl"
like image 30
Tyler Rinker Avatar answered Oct 11 '22 14:10

Tyler Rinker