My script below is counting the occurrences of the sequences 'CCCCAAAA' and 'GGGGTTTT' from a standard FASTA file:
>contig00001
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT
The script counts the CCCCAAAA sequence here 3 times
CCCCAAAACCCCAAAACCCCAAAA(CCCC not counted)
Can somebody please advise how I would include the CCCC sequence at the end as a half count to return a value of 3.5 for this.
I've been unsuccessful in my attempts so far.
My script is as follows...
from Bio import SeqIO
input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n')
for cur_record in SeqIO.parse(input_file, "fasta") :
contig = cur_record.name
CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
CCCC_count = cur_record.seq.count('CCCC')
GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
GGGG_count = cur_record.seq.count('GGGG')
#length = len(cur_record.seq)
splittedContig1=contig.split(CCCCAAAA_count)
splittedContig2=contig.split(GGGGTTTT_count)
cnt1=len(splittedContig1)-1
cnt2=len(splittedContig2)
cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count
output_line = '%s\t%i\t%i\n' % \
(CONTIG, CCCCAAAA_count, GGGGTTTT_count)
output_file.write(output_line)
output_file.close()
input_file.close()
You can use split and startwith list comprehension as follows:
contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
splitbase="CCCCAAAA"
halfBase="CCCC"
splittedContig=contig.split(splitbase)
cnt=len(splittedContig)-1
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])
Output:
3.5
CCCCAAAA. It would give the list, in the list elements CCCCAAAA will be removedCCCCAAAACCCC. If found add 0.5 to count for each occurence.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With