Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace the wild card characters with sampled characters in R

Tags:

string

r

I have the following sequence:

s0 <- "KDRH?THLA???RT?HLAK"

The wild card character there is indicated by ?. What I want to do is to replace that character by sampled character from this vector:

AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

Since s0 has 5 wild cards ?, I would sample from AADict:

set.seed(1)
nof_wildcard <- 5
tolower(sample(AADict, nof_wildcard, TRUE))

Which gives [1] "d" "q" "a" "r" "l"

Hence the expected result is:

     KDRH?THLA???RT?HLAK
     KDRHdTHLAqarRTlHLAK

So the placement of the sampled character must be exactly in the same position as ?, but the order of the character is not important. e.g. this answer is also acceptable: KDRHqTHLAdlaRTrHLAK.

How can I achieve that with R?

The other example are:

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"
like image 543
scamander Avatar asked Oct 11 '25 09:10

scamander


2 Answers

One approach is to replace the "?" characters 'one at a time' using a loop, e.g.

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
s0
#> [1] "KDRH?THLA???RT?HLAK"
repeat{s0 <- sub("\\?", sample(tolower(AADict), 1), s0); if(grepl("\\?", s0) == FALSE) break}
s0
#> [1] "KDRHtTHLAidwRTyHLAK"

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
repeat{s1 <- sub("\\?", sample(tolower(AADict), 1), s1); if(grepl("\\?", s1) == FALSE) break}
s1
#> [1] "FKDHKHIDVKDRHRTHLAKrstaRTRHLAK"

s2 <- "FKHIDVKDRHRTRHLAK??????????"
repeat{s2 <- sub("\\?", sample(tolower(AADict), 1), s2); if(grepl("\\?", s2) == FALSE) break}
s2
#> [1] "FKHIDVKDRHRTRHLAKdvcfmheiqn"

Another approach which can also allow for sampling without replacement:

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
matches <- gregexpr("\\?", s0)
regmatches(s0, matches) <- lapply(lengths(matches), sample, x = tolower(AADict), replace = FALSE)
s0
#> [1] "KDRHdTHLAlanRTiHLAK"

Created on 2022-10-22 by the reprex package (v2.0.1)

like image 128
jared_mamrot Avatar answered Oct 14 '25 18:10

jared_mamrot


You could split your string in single characters which makes it easy to replace the wildcard without the need of a loop (was my first approach):

replace_wc <- function(x, dict) {
  x <- strsplit(x, split = "")[[1]]
  ix <- grepl("\\?", x)
  x[ix] <- sample(dict, sum(ix), replace = TRUE)

  return(paste0(x, collapse = ""))
}

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c(
  "A", "R", "N", "D", "C", "E", "Q", "G", "H",
  "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)

set.seed(1)

replace_wc(s0, tolower(AADict))
#> [1] "KDRHdTHLAqarRTlHLAK"
like image 24
stefan Avatar answered Oct 14 '25 17:10

stefan