Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find substrings flanked by a specific character and replace with text of the same length in R?

Tags:

regex

r

In R, what is the best way of finding dots flanked by asterisks and replace them with asterisks?

input:

"AG**...**GG*.*.G.*C.C"

desired output:

"AG*******GG***.G.*C.C"

I tried the following function, but it is not elegant to say the least.

    library(stringr)

    replac <- function(my_string) {

        m <- str_locate_all(my_string, "\\*\\.+\\*")[[1]]

        if (nrow(m) == 0) return(my_string)

        split_s <- unlist(str_split(my_string, "")) 

        for (i in 1:nrow(m)) {
            st <- m[i, 1]
            en <- m[i, 2] 
            split_s[st:en] <- rep("*", length(st:en))
        }

        paste(split_s, collapse = "")
    }
  • I've have edited the input string and expected output after @TheForthBird answer below to make clear that dots not flanked by asterisks should not be changed, and that other letters other and "A" and "G" may occur.
like image 876
Vitor Avatar asked Dec 12 '25 14:12

Vitor


2 Answers

You might use gsub with perl = TRUE and make use of the \G anchor to assert the position at the end of the previous match.

You could match AG or GG using a character class [AG]G or [A-Z]+ to match 1+ uppercase characters.

In the replacement use *

(?:[A-Z]+\*+|\G(?!^))\K\.(?=[^*]*\*)

That will match

  • (?: Non capturing group
  • [A-Z]+*+Match 1+ times uppercase char A-Z, then 1+ times*`
    • | Or
    • \G(?!^) Assert position at the end of previous match, not at the start
  • ) Close non capturing group
  • \K Forget what is currently matched
  • \. Match literally
  • (?= Positive lookahead, assert what is on the right is
    • [^*]*\* Match 0+ times any char except *, then match *
  • ) Close lookahead

Regex demo | R demo

For example:

gsub("(?:[A-Z]+\\*+|\\G(?!^))\\K\\.(?=[^*]*\\*)", "*", "AG**...**GG*.*.G.*C.C", perl = TRUE)

Result

[1] "AG*******GG***.G.*C.C"
like image 94
The fourth bird Avatar answered Dec 15 '25 05:12

The fourth bird


Try this code, it's still not wrapped, but at least is a bit shorter than yours and works for all the cases, not only the ones without other occurrences of dots in the string:

replac_v2 <- function(my_string){
    b <- my_string #Just a shorter name
    while(TRUE){
        df<-as.data.frame(str_locate(b,"\\*\\.+\\*"))
        add<-as.numeric(df[2]-df[1])+1
        if(is.na(add)){return(b)}
        b<-str_replace(b,"\\*\\.+\\*",paste(rep("*",add),collapse=""))
    }}
like image 23
Ghost Avatar answered Dec 15 '25 03:12

Ghost



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!