Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

separate the abnormal reads of DNA (A,T,C,G) templates

I have millions of DNA clone reads and few of them are misreads or error. I want to separate the clean reads only.

For non biological background:

DNA clone consist of only four characters (A,T,C,G) in various permutation/combination. Any character, symbol or sign other that "A","T","C", and "G" in DNA is an error. Is there any way (fast/high throughput) in python to separate the clean reads only.

Basically I want to find a way through which I can separate a string which has nothing but "A","T","C","G" alphabet characters.

Edit
correct_read_clone: "ATCGGTTCATCGAATCCGGGACTACGTAGCA"

misread_clone: "ATCGGNATCGACGTACGTACGTTTAAAGCAGG" or "ATCGGTT@CATCGAATCCGGGACTACGTAGCA" or "ATCGGTTCATCGAA*TCCGGGACTACGTAGCA" or "AT?CGGTTCATCGAATCCGGGACTACGTAGCA" etc

I have tried the below for loop

check_list=['A','T','C','G']
for i in clone:
    if i not in check_list:
        continue

but the problem with this for loop is, it iterates over the string and match one by one which makes this process slow. To clean millions of clone this delay is very significant.

like image 786
shivam Avatar asked Jan 25 '26 14:01

shivam


1 Answers

If these are the nucleotide sequences with an error in 2 of them,

a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
test = a, b, c, d, e, f, g

try:

misread_counter = 0
correct_read_clone = []
for clone in test:
    if len(set(list(clone))) <= 4:
        correct_read_clone.append(clone)
    else:
        misread_counter +=1

print(f'Unclean sequences: {misread_counter}')
print(correct_read_clone)

Output:

Unclean sequences: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']

This way the for loop only has to attend each full sequence in a list of clones, rather than looping over each character of every sequence.

or if you want to know which ones have the errors you can make two lists:

misread_clone = []
correct_read_clone = []
for clone in test:
    bases = len(set(list(clone)))
    misread_clone.append(clone) if bases > 4 else correct_read_clone.append(clone)
      

print(f'misread sequences count: {len(misread_clone)}')
print(correct_read_clone)

Output:

misread sequences count: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']
like image 67
Jamie Dormaar Avatar answered Jan 27 '26 02:01

Jamie Dormaar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!