(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)
The task I'm trying to achieve is to align several sequences.
I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.
My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)
Real_sequence The_sequence_we_see
1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
Here is an example code to reproduce the above example:
ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG)
{
number.of.changes <- sample(seq_len(number.of.changes), 1)
new.letters <- sample(letters.to.change.with , number.of.changes, T)
where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
x[where.to.change.the.letters] <- new.letters
return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-")
insert.missing.values(original.seq)
seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))
seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")
# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
I understand that if all I had was a string and a pattern I would be able to use
library(Biostrings)
pairwiseAlignment(...)
But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).
Is there a known method for doing this in R?
ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or more sequences. For the alignment of two sequences please instead use our pairwise sequence alignment tools.
Summary: ClustalW is a tool for aligning multiple protein or nucleotide sequences. The alignment is achieved via three steps: pairwise alignment, guide-tree generation and progressive alignment.
You can perform multiple alignment in R with the DECIPHER package.
Following your example, it would look something like:
library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)
It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With