In R, what is the best way of finding dots flanked by asterisks and replace them with asterisks?
input:
"AG**...**GG*.*.G.*C.C"
desired output:
"AG*******GG***.G.*C.C"
I tried the following function, but it is not elegant to say the least.
library(stringr)
replac <- function(my_string) {
m <- str_locate_all(my_string, "\\*\\.+\\*")[[1]]
if (nrow(m) == 0) return(my_string)
split_s <- unlist(str_split(my_string, ""))
for (i in 1:nrow(m)) {
st <- m[i, 1]
en <- m[i, 2]
split_s[st:en] <- rep("*", length(st:en))
}
paste(split_s, collapse = "")
}
You might use gsub with perl = TRUE and make use of the \G anchor to assert the position at the end of the previous match.
You could match AG or GG using a character class [AG]G or [A-Z]+ to match 1+ uppercase characters.
In the replacement use *
(?:[A-Z]+\*+|\G(?!^))\K\.(?=[^*]*\*)
That will match
(?: Non capturing group[A-Z]+*+Match 1+ times uppercase char A-Z, then 1+ times*`
| Or\G(?!^) Assert position at the end of previous match, not at the start) Close non capturing group\K Forget what is currently matched\. Match literally(?= Positive lookahead, assert what is on the right is
[^*]*\* Match 0+ times any char except *, then match *) Close lookaheadRegex demo | R demo
For example:
gsub("(?:[A-Z]+\\*+|\\G(?!^))\\K\\.(?=[^*]*\\*)", "*", "AG**...**GG*.*.G.*C.C", perl = TRUE)
Result
[1] "AG*******GG***.G.*C.C"
Try this code, it's still not wrapped, but at least is a bit shorter than yours and works for all the cases, not only the ones without other occurrences of dots in the string:
replac_v2 <- function(my_string){
b <- my_string #Just a shorter name
while(TRUE){
df<-as.data.frame(str_locate(b,"\\*\\.+\\*"))
add<-as.numeric(df[2]-df[1])+1
if(is.na(add)){return(b)}
b<-str_replace(b,"\\*\\.+\\*",paste(rep("*",add),collapse=""))
}}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With