I have set of EST sequences in fasta file. Here, how to subset based on sequence ID or name?
>gi|296783888|gb|GW992815.1|GW992815 UAS-Mi10 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Putative phosphoribosyltransferase/phosphoribosylanthranilate-like gene from Morus indica, mRNA sequence
GCAGCCGTCGGATCGTGAGCGTGATCGCGTGGCTAGTCGGGTTGGCGAAATGGTTGGATGATATCCGGAG
GTGGAGGAACCCCATTACCACGGTATTGGTCCACATCTTATATTTAGTGCTTGTTTGGTACCCGGATTTG
ATTGTCCCAACCGGGTTTTTATATGTGTTCCTAATCGGTGTATGGTACTATCGGTTTCGGCCCAAGATAC
CAGCGGGTATGGATACCCGACTCTCACAAGCTGAAGCGGTTGACCCGGATGAGCTTGATGAGGAATTCGA
CACCATACCGAGCTCAAAACCACCCGACATAATCAGGGTCCGGTATGACCGGTTGCGGATATTGGCAGCC
CGGGTTCAAACGGTTTTGGGTGATTTTGCAACACAAGGGGAGCGGGTTCAGGCCTTGGTTAGCTGGAGGG
ACCCAAGGGCCACAAAATTGTTCATAGGCGTGTGCTTGGCCATAACAATAATTCTCTATGTGGTGCCACC
CAAAATGGTTGCCGTGGCACTTGGATTCTACTATTTACGACACCCCATGTTCCGAGACCCCATGCCTCCT
GCAAGCTTGAATTTCTTCAGAAGGCTTCCAAGCCTTTCAGACCGCTTTAATGTAGATTAGAATATTATAT
GATTATTAGTAGGCCCAA
>gi|296783887|gb|GW992814.1|GW992814 UAS-Mi9 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Dehydration-responsive protein RD22, Similar to BURP domain-containing protein like gene from Morus indica, mRNA sequence
AAGCAGTGGTCTAGAACCAGAGTGGCCCCTGCGATGCAGGTATCATCTCTATTATCAAAAGGGATAAGGG
GTGGATCCGTCGGGGATTTGAGTCTCACATGGTCGCTGATAACTTATTGAATGGATATTGGATTGTGTGC
AGTGCGACCTAAACAGGATTGCCGTTGGGGCCTGTGGTCAGAGATACCCCACACTTCTCAACTCCCAAAT
TGGATCTTGTTCCTTGTTTTCCTGTATTAAGCCTGACCCCTGAGGCTTTCGCCACTGCCAACTGGGTGCC
GCCTGCTGACTTCTGATTCCCCGTGCTAACGGTTACTCCCGATTCCTTATCCACATCGAAGATGAACTAT
TGACTTCCGCAAACTCAAAAGGCTGCAAGATATCACTGACCGCTGTCGGGATCCGCGATCGGCATATACG
CGAAATCCGATCCCGGATCCCGGGACTGCAGACGGCTGAA
Like using header line >gi|296783888|gb|GW992815.1|GW992815 UAS-Mi10 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Putative phosphoribosyltransferase/phosphoribosylanthranilate-like gene from Morus indica, mRNA sequence or using only >gi|296783888
How to do this in R?
For a slightly more heavy-weight solution, if this fits in to the Bioconductor work flow,
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
to install the Biostrings and Rsamtools package, then
library(Rsamtools)
indexFa("foo.fasta") # create an index of file 'foo.fasta'
fa = FaFile("foo.fasta") # reference the fasta file and it's index
You can discover the coordinates (names and start / end) of each sequence with
gr = as(seqinfo(fa), "GRanges")
and query for arbitrary sequences and ranges within sequences by choosing appropriate subsets, e.g., the second sequence and then first sequence in your example
getSeq(fa, gr[2:1])
or by looking up the coordinates by partial match to the names
idx = pmatch("gi|296783888", names(gr)) ## NA's if duplicates or not unique
seq = getSeq(fa, gr[idx])
"seq" is a DNAStringSet, and can be manipulated in many ways; see the vignettes available in the package
vignette(package="Biostrings")
especially the Quick Overview. To save the object to a fasta file 'file.fa' in a directory 'some' relative to the current working directory, use
writeXStringSet(seq, "some/file.fa")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With