Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String manipulation in R: remove specific pattern in multiple places without removing text in between instances of the pattern

Tags:

string

regex

r

In R, I am attempting to write code that will work on any adaptations of a string pattern. An example of a string is:

string <- "y ~ 1 + a + (b | c) + (d^2) + e + (1 | f) + g"

I would like to remove ONLY the portions that contain a pattern of "(, |, )" such as:

(b | c) and (1 | f)

and be left with:

"y ~ 1 + a + (d^2) + e + g"

Please note that the characters could change values (e.g., 'b' could become '1' and 'c' could become 'predictor') and I would like the code to still work. Spaces are also not required for the string, it could also be "y~1+a+(b|c)+(d^2)+e+(1|f)+g" or any combination of space/no-space thereof. The order of the characters could change as well to be "y~1+a+(b|c)+e+(1|f)+(d^2)+g".

I have tried using base R string manipulation functions (gsub and sub) to search for the pattern of "(, |, )" by using variations of the pattern such as:

"\\(.*\\|.*\\)"
"\\(.*\\|"
"\\(.+\\|.+\\)"
"\\|.+\\)"

as well as many of the stringr functions to find and replace this pattern with a blank. However, using both base R and stringr what happens when I do this is that it removes EVERYTHING, for example:

gsub("\\(.*\\|.*\\)", "", string)

produces:

"y ~ 1 + a +  + g"

and

gsub("\\(.*\\|", "", string)

produces:

"y ~ 1 + a +  f) + g"

I have additionally tried using the str_locate functions but am running into issues using that efficiently since there are multiple sets of parentheses and I want the locations only of the instances with a "|" between them.

Any help is greatly appreciated.

like image 941
Laura Jamison Avatar asked Jan 25 '23 23:01

Laura Jamison


1 Answers

1) gsubfn Define a function which returns an empty string or its input depending on whether the input has a | or not and run gsubfn with it. gsubfn is like gsub except the replacement string can be a function which takes the match as input and replaces it with the function's output.

library(gsubfn)

pick <- function(x) if (grepl("|", x, fixed = TRUE)) "" else trimws(x)
gsubfn("[+] *[(].*?[)]", pick, string, perl = TRUE)
## [1] "y ~ 1 + a  + (d^2) + e  + g"

2) Base R Split the input into terms and grep out the ones without |. Then put what is left back together using reformulate.

s <- trimws(grep("\\|", strsplit(string, "[~+]")[[1]], invert = TRUE, value = TRUE))
reformulate(format(s[-1]), s[1])
## y ~ 1 + a + (d^2) + e + g

3) getTerms This also uses only base R but first converts the string to an expression representing a formula and parses it using getTerms found in this SO post: Terms of a sum in a R expression It then proceeds as in (2).

p <- parse(text = string)[[1]]
s <- grep("\\|", getTerms(p[[3]]), value = TRUE, invert = TRUE)
reformulate(s, p[[2]])
## y ~ 1 + a + (d^2) + e + g
like image 74
G. Grothendieck Avatar answered Jan 29 '23 09:01

G. Grothendieck