Compare every *nd symbol of a text string

Question

the problem is I got large text file. Let it be

 a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")

I need to compare every 3rd symbol in this text with value (e.g. 'c') and if true, I want to add 1 to counter i. I thought to use grep but it seems this function wouldn't suite for my purpose. So I need your help or advice.

More than that, I want to extract certain values from this string to a vector. 4 example, i want to extract 4:10 symbols, e.g.

 a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")
[1] "gatcgatcga"

Thank you in advance.

P.S.

I know it's not the best idea to write script i need in R, but I'm curious if its possible to write it in an adequate way.

Josh O'Brien · Accepted Answer

Edited to provide a solution that's fast for much larger strings:

If you have a very long string (on the order of millions of nucleotides), the lookbehind assertion in my original answer (below) is too slow to be practical. In that case, use something more like the following, which: (1) splits the string apart between every character; (2) uses the characters to fill up a three row matrix; and then (3) extracts the characters in the 3rd row of the matrix. This takes on the order of 0.2 seconds to process a 3-million character long string.

## Make a 3-million character long string
a <- paste0(sample(c("a", "t", "c", "g"), 3e6, replace=TRUE), collapse="")

## Extract the third codon of each triplet
n3  <- matrix(strsplit(a, "")[[1]], nrow=3)[3,]

## Check that it works
sum(n3=="c")
# [1] 250431
table(n3)
#  n3
#      a      c      g      t 
# 250549 250431 249008 250012

Original answer:

I might use substr() in both cases.

## Split into codons. (The "lookbehind assertion", "(?<=.{3})" matches at each
## inter-character location that's preceded by three characters of  any type.)
codons <- strsplit(a, "(?<=.{3})", perl=TRUE)[[1]]
#  [1] "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg"

## Extract 3rd nucleotide in each codon
n3 <- sapply(codons, function(X) substr(X,3,3))
# atc gat cga tcg atc gat cga tcg atc gat cga tcg 
# "c" "t" "a" "g" "c" "t" "a" "g" "c" "t" "a" "g" 

## Count the number of 'c's
sum(n3=="c")
# [1] 3


## Extract nucleotides 4-10
substr(a, 4,10)
# [1] "gatcgat"

Compare every *nd symbol of a text string

Tags:

r

Lionir

Video Answer

1 Answers

Josh O'Brien

Recent Activity

Donate For Us

Compare every *nd symbol of a text string

Tags:

r

Lionir

Video Answer

1 Answers

Josh O'Brien

Related questions

Recent Activity

Donate For Us