I would like to identify if the string column in the data frame below repeats the letters "V" or "G" at least 5 times within the first 20 characters of the string.
Sample data:
data = data.frame(class = c('a','b','C'), string =
c("ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ",
"AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD",
"GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER"))
For example the string in the first row has "VVVVG" within the first 20 character positions. Similarly the string in third row has "VVGGV".
data
# class string
#1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ
#2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD
#3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER
The desired output should look like this:
# class string result
# 1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ TRUE
# 2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD FALSE
# 3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER TRUE
Similar to Akrun's
transform(data, result=grepl("[VG]{5,}", substr(string, 1, 20)))
Produces
class string result
1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ TRUE
2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD FALSE
3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER TRUE
Here we use grep
combined with a character class that matches either "G" or "V" ([VG]
) repeated 5 or more times ({5, }
). transform
just creates a new data frame with either added or modified columns.
EDIT: some benchmarks against Matthew's creative answer:
set.seed(1)
string <- vapply(
replicate(1e5, sample(c("V", "G", "A", "S"), sample(20:300, 1), rep=T)),
paste0, character(1L), collapse=""
)
library(microbenchmark)
microbenchmark(
grepl("[VG]{5,}", substr(string, 1, 20)),
grepl("^.{,15}[VG]{5,}", string),
times=10
)
Produces:
Unit: milliseconds
expr min lq mean
grepl("[VG]{5,}", substr(string, 1, 20)) 131.6668 131.8343 133.6644
grepl("^.{,15}[VG]{5,}", string) 299.7326 300.4416 302.5065
Wasn't entirely sure what to expect, but I guess it makes sense since substr
is very simple to apply. Times are very close if the pattern has the 5 repeats near the front of the string.
Another option, without substr
:
within(data, result<-grepl('^.{,15}[VG]{5,}', string))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With