separate the abnormal reads of DNA (A,T,C,G) templates

Question

I have millions of DNA clone reads and few of them are misreads or error. I want to separate the clean reads only.

For non biological background:

DNA clone consist of only four characters (A,T,C,G) in various permutation/combination. Any character, symbol or sign other that "A","T","C", and "G" in DNA is an error. Is there any way (fast/high throughput) in python to separate the clean reads only.

Basically I want to find a way through which I can separate a string which has nothing but "A","T","C","G" alphabet characters.

Edit
correct_read_clone: "ATCGGTTCATCGAATCCGGGACTACGTAGCA"

misread_clone: "ATCGGNATCGACGTACGTACGTTTAAAGCAGG" or "ATCGGTT@CATCGAATCCGGGACTACGTAGCA" or "ATCGGTTCATCGAA*TCCGGGACTACGTAGCA" or "AT?CGGTTCATCGAATCCGGGACTACGTAGCA" etc

I have tried the below for loop

check_list=['A','T','C','G']
for i in clone:
    if i not in check_list:
        continue

but the problem with this for loop is, it iterates over the string and match one by one which makes this process slow. To clean millions of clone this delay is very significant.

Jamie Dormaar · Accepted Answer

If these are the nucleotide sequences with an error in 2 of them,

a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
test = a, b, c, d, e, f, g

try:

misread_counter = 0
correct_read_clone = []
for clone in test:
    if len(set(list(clone))) <= 4:
        correct_read_clone.append(clone)
    else:
        misread_counter +=1

print(f'Unclean sequences: {misread_counter}')
print(correct_read_clone)

Output:

Unclean sequences: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']

This way the for loop only has to attend each full sequence in a list of clones, rather than looping over each character of every sequence.

or if you want to know which ones have the errors you can make two lists:

misread_clone = []
correct_read_clone = []
for clone in test:
    bases = len(set(list(clone)))
    misread_clone.append(clone) if bases > 4 else correct_read_clone.append(clone)
      

print(f'misread sequences count: {len(misread_clone)}')
print(correct_read_clone)

Output:

misread sequences count: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']

separate the abnormal reads of DNA (A,T,C,G) templates

Tags:

python

bioinformatics

dna-sequence

biopython

python-re

shivam

1 Answers

Jamie Dormaar

Recent Activity

Donate For Us

separate the abnormal reads of DNA (A,T,C,G) templates

Tags:

python

bioinformatics

dna-sequence

biopython

python-re

shivam

1 Answers

Jamie Dormaar

Related questions

Recent Activity

Donate For Us