Counting DNA sequences with python/biopython

Question

My script below is counting the occurrences of the sequences 'CCCCAAAA' and 'GGGGTTTT' from a standard FASTA file:

>contig00001  
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT

The script counts the CCCCAAAA sequence here 3 times

CCCCAAAACCCCAAAACCCCAAAA(CCCC not counted)

Can somebody please advise how I would include the CCCC sequence at the end as a half count to return a value of 3.5 for this.

I've been unsuccessful in my attempts so far.

My script is as follows...

from Bio import SeqIO

input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig	CCCCAAAA	GGGGTTTT
')

for cur_record in SeqIO.parse(input_file, "fasta") :


    contig = cur_record.name
    CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
    CCCC_count = cur_record.seq.count('CCCC')

    GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
    GGGG_count = cur_record.seq.count('GGGG')
    #length = len(cur_record.seq)

    splittedContig1=contig.split(CCCCAAAA_count)

    splittedContig2=contig.split(GGGGTTTT_count)

    cnt1=len(splittedContig1)-1
    cnt2=len(splittedContig2)

  cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
  cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count

    output_line = '%s	%i	%i
' % \
    (CONTIG, CCCCAAAA_count, GGGGTTTT_count)


    output_file.write(output_line)

output_file.close()

input_file.close()

venpa · Accepted Answer

You can use split and startwith list comprehension as follows:

contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
splitbase="CCCCAAAA"
halfBase="CCCC"
splittedContig=contig.split(splitbase)
cnt=len(splittedContig)-1
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])

Output:

3.5

split the strings based on CCCCAAAA. It would give the list, in the list elements CCCCAAAA will be removed
length of splitted - 1 gives the number of occurrence of CCCCAAAA
in the splitted element, look for elements starts with CCCC. If found add 0.5 to count for each occurence.

Counting DNA sequences with python/biopython

Tags:

python

bioinformatics

dna-sequence

biopython

sheaph

1 Answers

venpa

Recent Activity

Donate For Us

Counting DNA sequences with python/biopython

Tags:

python

bioinformatics

dna-sequence

biopython

sheaph

1 Answers

venpa

Related questions

Recent Activity

Donate For Us