I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list.
Lets say my list includes 5 members:
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
I want set k=2
(2-mer) so 4^2=16
combination are available including AA,AT,AC,AG,TA,TT,...
So my table will have 5 rows
and 16 columns
. I want to count number of matches between my k-mers and list members.
My desired result: df:
lstMemb AA AT AC AG TA TT TC ...
1 2 1 1 0 0 3 0
2 ...
3
4
5
Could you help me implement this in R?
A k-mer is just a sequence of k characters in a string (or nucleotides in a DNA sequence). Now, it is important to remember that to get all k-mers from a sequence you need to get the first k characters, then move just a single character for the start of the next k-mer and so on.
Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where.
Background. Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads.
A kmer is just a nucleotide sequence of a certain length. For instance a dinucleotide is a kmer where k=2. When we talk about all kmers to talk about all the possible sequences of that length. So for example, when K=2 all the possible kmers are: AA AT AC AG TA TT TC TG CA CT CC CG GA GT GC GG.
If you are looking for speed the obvious solution is stringi
package.
There is stri_count_fixed
function for counting patterns.
And now, check the code and benchmark!
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
colnames(result) <- dna
result
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
fstri <- function(x){
t(sapply(x, stri_count_fixed,dna,T))
}
fbio <- function(x){
t(sapply(x, function(x){x1 <- DNAString(x); oligonucleotideFrequency(x1,2)}))
}
all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
[1] TRUE
longDNA <- sample(DNAlst,100,T)
microbenchmark(fstri(longDNA),fbio(longDNA))
Unit: microseconds
expr min lq mean median uq max neval
fstri(longDNA) 689.378 738.184 825.3014 766.862 793.134 6027.039 100
fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100
127245.489/766.862
## [1] 165.9301
Ca 165x times faster :)
May be this helps
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
t(sapply(DNAlst, function(x){x1 <- DNAString(x)
oligonucleotideFrequency(x1,2)}))
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
#[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
#[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
#[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
#[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
#[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
Or as suggested by @Arun, convert the list
to vector
first
oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
#[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
#[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
#[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
#[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
#[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With