I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list. Lets say my list includes 5 members: <pre class="prettyprint"><code>DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA") </code></pre> I want set <code>k=2</code> (2-mer) so <code>4^2=16</code> combination are available including <code>AA,AT,AC,AG,TA,TT,...</code> So my table will have <code>5 rows</code> and <code>16 columns</code>. I want to count number of matches between my k-mers and list members. My desired result: <code>df:</code> <pre class="prettyprint"><code>lstMemb AA AT AC AG TA TT TC ... 1 2 1 1 0 0 3 0 2 ... 3 4 5 </code></pre> Could you help me implement this in R?

If you are looking for speed the obvious solution is <code>stringi</code> package. There is <code>stri_count_fixed</code> function for counting patterns. And now, check the code and benchmark! <pre class="prettyprint"><code>DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA") dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T")) result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE)) colnames(result) <- dna result AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT [1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3 [2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 [3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1 [4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 [5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0 fstri <- function(x){ t(sapply(x, stri_count_fixed,dna,T)) } fbio <- function(x){ t(sapply(x, function(x){x1 <- DNAString(x); oligonucleotideFrequency(x1,2)})) } all(fstri(DNAlst)==fbio(DNAlst)) #results are the same [1] TRUE longDNA <- sample(DNAlst,100,T) microbenchmark(fstri(longDNA),fbio(longDNA)) Unit: microseconds expr min lq mean median uq max neval fstri(longDNA) 689.378 738.184 825.3014 766.862 793.134 6027.039 100 fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100 127245.489/766.862 ## [1] 165.9301 </code></pre> Ca 165x times faster :)

May be this helps <pre class="prettyprint"><code> source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings) t(sapply(DNAlst, function(x){x1 <- DNAString(x) oligonucleotideFrequency(x1,2)})) # AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT #[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3 #[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 #[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1 #[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 #[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0 </code></pre> Or as suggested by @Arun, convert the <code>list</code> to <code>vector</code> first <pre class="prettyprint"><code> oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L) # AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT #[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3 #[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 #[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1 #[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 #[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0 </code></pre>

matching and counting strings (k-mer of DNA) in R

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list.

Lets say my list includes 5 members:

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")

I want set k=2 (2-mer) so 4^2=16 combination are available including AA,AT,AC,AG,TA,TT,...

So my table will have 5 rows and 16 columns. I want to count number of matches between my k-mers and list members.

My desired result: df:

lstMemb AA AT AC AG TA TT TC ...
  1     2  1  1  0  0  3  0
  2       ...
  3
  4
  5

Could you help me implement this in R?

How do you calculate k-mer?

A k-mer is just a sequence of k characters in a string (or nucleotides in a DNA sequence). Now, it is important to remember that to get all k-mers from a sequence you need to get the first k characters, then move just a single character for the start of the next k-mer and so on.

What is k-mer in sequencing?

Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where.

Which methods use k-mers?

Background. Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads.

What is K in KMER?

A kmer is just a nucleotide sequence of a certain length. For instance a dinucleotide is a kmer where k=2. When we talk about all kmers to talk about all the possible sequences of that length. So for example, when K=2 all the possible kmers are: AA AT AC AG TA TT TC TG CA CT CC CG GA GT GC GG.

If you are looking for speed the obvious solution is stringi package. There is stri_count_fixed function for counting patterns. And now, check the code and benchmark!

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
colnames(result) <- dna
result
     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0



fstri <- function(x){
    t(sapply(x, stri_count_fixed,dna,T))
}
fbio <- function(x){
    t(sapply(x, function(x){x1 <-  DNAString(x); oligonucleotideFrequency(x1,2)}))
}

all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
[1] TRUE

longDNA <- sample(DNAlst,100,T)
microbenchmark(fstri(longDNA),fbio(longDNA))
Unit: microseconds
           expr        min         lq        mean     median         uq        max neval
 fstri(longDNA)    689.378    738.184    825.3014    766.862    793.134   6027.039   100
  fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294   100
127245.489/766.862
## [1] 165.9301

Ca 165x times faster :)

May be this helps

 source("http://bioconductor.org/biocLite.R")
 biocLite("Biostrings")
 library(Biostrings)
 t(sapply(DNAlst, function(x){x1 <-  DNAString(x)
                   oligonucleotideFrequency(x1,2)}))
  #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
  #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
  #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
  #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
  #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
  #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0

Or as suggested by @Arun, convert the list to vector first

   oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
   #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
   #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
   #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
   #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
   #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
   #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0

matching and counting strings (k-mer of DNA) in R

Tags:

string

regex

r

dna-sequence

Cina

People also ask

2 Answers

bartektartanus

akrun

Recent Activity

Donate For Us

matching and counting strings (k-mer of DNA) in R

Tags:

string

regex

r

dna-sequence

Cina

People also ask

2 Answers

bartektartanus

akrun

Related questions

Recent Activity

Donate For Us