Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting DNA sequences with python/biopython

My script below is counting the occurrences of the sequences 'CCCCAAAA' and 'GGGGTTTT' from a standard FASTA file:

>contig00001  
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT

The script counts the CCCCAAAA sequence here 3 times

CCCCAAAACCCCAAAACCCCAAAA(CCCC not counted)

Can somebody please advise how I would include the CCCC sequence at the end as a half count to return a value of 3.5 for this.

I've been unsuccessful in my attempts so far.

My script is as follows...

from Bio import SeqIO

input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n')

for cur_record in SeqIO.parse(input_file, "fasta") :


    contig = cur_record.name
    CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
    CCCC_count = cur_record.seq.count('CCCC')

    GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
    GGGG_count = cur_record.seq.count('GGGG')
    #length = len(cur_record.seq)

    splittedContig1=contig.split(CCCCAAAA_count)

    splittedContig2=contig.split(GGGGTTTT_count)

    cnt1=len(splittedContig1)-1
    cnt2=len(splittedContig2)

  cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
  cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count

    output_line = '%s\t%i\t%i\n' % \
    (CONTIG, CCCCAAAA_count, GGGGTTTT_count)


    output_file.write(output_line)

output_file.close()

input_file.close() 
like image 944
sheaph Avatar asked May 21 '26 17:05

sheaph


1 Answers

You can use split and startwith list comprehension as follows:

contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
splitbase="CCCCAAAA"
halfBase="CCCC"
splittedContig=contig.split(splitbase)
cnt=len(splittedContig)-1
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])

Output:

3.5
  1. split the strings based on CCCCAAAA. It would give the list, in the list elements CCCCAAAA will be removed
  2. length of splitted - 1 gives the number of occurrence of CCCCAAAA
  3. in the splitted element, look for elements starts with CCCC. If found add 0.5 to count for each occurence.
like image 123
venpa Avatar answered May 26 '26 10:05

venpa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!