Say I have a fasta containing 3 sequences...
ATTTTTGGA
AT
A
I want my sequence data to look like this:
ATTTTTGGA
ATTNNNNNN
ANNNNNNNN
Are there any programs or scripts that could accomplish this in a reasonable timeframe. I have thousands of sequences. Thanks!
I'm messing around and tried this, the file ended up blank but this is as far as I have gotten.
import sys
from Bio import SeqIO
from Bio.Seq import Seq
in_file = open(sys.argv[1],'r')
sequences = SeqIO.parse(in_file, "fasta")
output_in_file = open("test.fasta", "w")
for record in sequences:
n = 150
record.seq = record.seq + ("N" * n)
seq = seq[:n]
output_in_file.close()
in_file.close()
Improving your code,
import sys
from Bio import SeqIO
from Bio.Seq import Seq
with open(sys.argv[1], "r") as in_file:
sequences = list(SeqIO.parse(in_file, "fasta"))
n = max(map(len, sequences)) #find max len in sequences
for record in sequences:
record.seq = record.seq + ("N" * (n-len(record)))
SeqIO.write(sequences, "test.fasta", "fasta")
you get, in test.fasta
>id_1 ATTTTTGGA >id_2 ATNNNNNNN >id_3 ANNNNNNNN
for "all equal 150bp"
import sys
from Bio import SeqIO
from Bio.Seq import Seq
with open(sys.argv[1], "r") as in_file:
sequences = list(SeqIO.parse(in_file, "fasta"))
n = 150
for record in sequences:
record.seq = record.seq + ("N" * (n-len(record)))
SeqIO.write(sequences, "test.fasta", "fasta")
you get,
>id_1 ATTTTTGGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >id_2 ATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >id_3 ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With