Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identify continuously occurring stretch of specific letters in a string using R

Tags:

r

substr

stringr

I would like to identify if the string column in the data frame below repeats the letters "V" or "G" at least 5 times within the first 20 characters of the string.

Sample data:

 data = data.frame(class = c('a','b','C'), string =
 c("ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ",
 "AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD",
 "GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER"))

For example the string in the first row has "VVVVG" within the first 20 character positions. Similarly the string in third row has "VVGGV".

data
#  class                                                  string
#1     a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ
#2     b      AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD
#3     C       GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER

The desired output should look like this:

#   class                                                  string result
# 1     a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ   TRUE
# 2     b      AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD  FALSE
# 3     C       GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER   TRUE
like image 957
Veerendra Gadekar Avatar asked Jun 04 '15 14:06

Veerendra Gadekar


2 Answers

Similar to Akrun's

transform(data, result=grepl("[VG]{5,}", substr(string, 1, 20)))

Produces

  class                                                  string result
1     a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ   TRUE
2     b      AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD  FALSE
3     C       GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER   TRUE

Here we use grep combined with a character class that matches either "G" or "V" ([VG]) repeated 5 or more times ({5, }). transform just creates a new data frame with either added or modified columns.


EDIT: some benchmarks against Matthew's creative answer:

set.seed(1)
string <- vapply(
  replicate(1e5, sample(c("V", "G", "A", "S"), sample(20:300, 1), rep=T)),
  paste0, character(1L), collapse=""
)
library(microbenchmark)
microbenchmark(
  grepl("[VG]{5,}", substr(string, 1, 20)),
  grepl("^.{,15}[VG]{5,}", string),
  times=10
)

Produces:

Unit: milliseconds
                                     expr      min       lq     mean
 grepl("[VG]{5,}", substr(string, 1, 20)) 131.6668 131.8343 133.6644
         grepl("^.{,15}[VG]{5,}", string) 299.7326 300.4416 302.5065

Wasn't entirely sure what to expect, but I guess it makes sense since substr is very simple to apply. Times are very close if the pattern has the 5 repeats near the front of the string.

like image 62
BrodieG Avatar answered Nov 16 '22 16:11

BrodieG


Another option, without substr:

within(data, result<-grepl('^.{,15}[VG]{5,}', string))
like image 38
Matthew Plourde Avatar answered Nov 16 '22 16:11

Matthew Plourde