I'm trying to read a FASTA file and then find specific motif(string) and print out the sequence and number of times it occurs. A FASTA file is just series of sequences(strings) that starts with a header line and the signature for header or start of a new sequence is ">". in a new line immediately after the header is the sequence of letters.I'm not done with code but so far I have this and it gives me this error:
AttributeError: 'str' object has no attribute 'next'
I'm not sure what's wrong here.
import re
header=""
counts=0
newline=""
f1=open('fpprotein_fasta(2).txt','r')
f2=open('motifs.xls','w')
for line in f1:
if line.startswith('>'):
header=line
#print header
nextline=line.next()
for i in nextline:
motif="ML[A-Z][A-Z][IV]R"
if re.findall(motif,nextline):
counts+=1
#print (header+'\t'+counts+'\t'+motif+'\n')
fout.write(header+'\t'+counts+'\t'+motif+'\n')
f1.close()
f2.close()
SeqIO provides a simple uniform interface to input and output assorted sequence file formats (including multiple sequence alignments), but will only deal with sequences as SeqRecord objects. There is a sister interface Bio. AlignIO for working directly with sequence alignment files as Alignment objects.
The error is likely coming from the line:
nextline=line.next()
line
is the string you have already read, there is no next()
method on it.
Part of the problem is that you're trying to mix two different ways of reading the file - you are iterating over the lines using for line in f1
and <handle>.next()
.
Also, if you are working with FASTA files I recommend using Biopython: it makes working with collections of sequences much easier. In particular, Chapter 14 on motifs will be of particular interest to you. This will likely require that you learn more about Python in order to achieve what you want, but if you're going to be doing a lot more bioinformatics than what your example here shows then it's definitely worth the investment of time.
This might help getting you in the right direction
import re
def parse(fasta, outfile):
motif = "ML[A-Z][A-Z][IV]R"
header = None
with open(fasta, 'r') as fin, open(outfile, 'w') as fout:
for line in fin:
if line.startswith('>'):
if header is not None:
fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
header = line
count = 0
else:
matches = re.findall(motif, line)
count += len(matches)
if header is not None:
fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
if __name__ == '__main__':
parse("fpprotein_fasta(2).txt", "motifs.xls")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With