the problem is I got large text file. Let it be
a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")
I need to compare every 3rd symbol in this text with value (e.g. 'c'
) and if true, I want to add 1
to counter i
.
I thought to use grep
but it seems this function wouldn't suite for my purpose.
So I need your help or advice.
More than that, I want to extract certain values from this string to a vector. 4 example, i want to extract 4:10 symbols, e.g.
a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg")
[1] "gatcgatcga"
Thank you in advance.
P.S.
I know it's not the best idea to write script i need in R, but I'm curious if its possible to write it in an adequate way.
Edited to provide a solution that's fast for much larger strings:
If you have a very long string (on the order of millions of nucleotides), the lookbehind assertion in my original answer (below) is too slow to be practical. In that case, use something more like the following, which: (1) splits the string apart between every character; (2) uses the characters to fill up a three row matrix; and then (3) extracts the characters in the 3rd row of the matrix. This takes on the order of 0.2 seconds to process a 3-million character long string.
## Make a 3-million character long string
a <- paste0(sample(c("a", "t", "c", "g"), 3e6, replace=TRUE), collapse="")
## Extract the third codon of each triplet
n3 <- matrix(strsplit(a, "")[[1]], nrow=3)[3,]
## Check that it works
sum(n3=="c")
# [1] 250431
table(n3)
# n3
# a c g t
# 250549 250431 249008 250012
Original answer:
I might use substr()
in both cases.
## Split into codons. (The "lookbehind assertion", "(?<=.{3})" matches at each
## inter-character location that's preceded by three characters of any type.)
codons <- strsplit(a, "(?<=.{3})", perl=TRUE)[[1]]
# [1] "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg"
## Extract 3rd nucleotide in each codon
n3 <- sapply(codons, function(X) substr(X,3,3))
# atc gat cga tcg atc gat cga tcg atc gat cga tcg
# "c" "t" "a" "g" "c" "t" "a" "g" "c" "t" "a" "g"
## Count the number of 'c's
sum(n3=="c")
# [1] 3
## Extract nucleotides 4-10
substr(a, 4,10)
# [1] "gatcgat"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With