Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding CDRs in NGS data

I have millions of sequences in fasta format and want to extract CDRs (CDR1, CDR2 and CDR3).I chose only one sequence as an example and tried to extract CDR1 but not able to extract CDR1.

sequence:-'FYSHSAVTLDESGGGLQTPGGGLSLVCKASGFTFSSYGMMWVRQAPGKGLEYVAGIRNDA GDKRYGSAVQGRATISRDNGQSTVRLQLNNLRAEDTGTYFCAKESGCYWDSTHCIDAWGH GTEVIVSTGG'.

cdr1 starts from:- 'VCKASGFTFS', with maximum three replacements but C at 2nd place is must. cdr1 ends at:-'WVRQAP', with maximum two replacements but R at 3rd place is must.

Extracted cdr1 should be SYGMM

def cdr1_in(cdr_in): #VCKASGFTFS
    pin=0
    max_pin=3       
    
    if cdr[1]!='C':
        pin+=1
    if cdr[0]!='V':
        pin+=1
    if cdr[2]!='K':
        pin+=1
    if cdr[3]!='A':
        pin+=1    
    if cdr[4]!='S':
        pin+=1
    if cdr[5]!='G':
        pin+=1
    if cdr[6]!='F':
        pin+=1
    if cdr[7]!='T':
        pin+=1    
    if cdr[8]!='F':
        pin+=1
    if cdr[9]!='S':
        pin+=1   
  
    if pin<max_pin:
        print('CDR_in pattern', cdr_in)
        # print('CDR_starts from', arr.index(cdr_in)+9)
        return (arr.index(cdr_in)+9)
 
    def cdr1_out(cdr_out):#WVRQAP
    
        pin=0
        max_pin=2            
        if cdr[1]!='V':
            pin+=1
        if cdr[0]!='W':
            pin+=1
        if cdr[2]!='R':
            pin+=1
        if cdr[3]!='Q':
            pin+=1    
        if cdr[4]!='A':
            pin+=1
        if cdr[5]!='P':
            pin+=1
            
        if pin<max_pin:
            # print('CDR_in pattern', cdr_out)
            # print('CDR_ends at', arr.index(cdr_out))
            return (arr.index(cdr_out))
 

K=10
arr=sequence
for i in range(len(arr)-k+1):
        slider=arr[i:k+i]
        print("CDR_1 is:", arr[cdr1_in(slider): cdr1_out(slider)])        
          
like image 259
shivam Avatar asked Aug 31 '25 20:08

shivam


1 Answers

Am I right to assume that you are analysing immunosequencing data and by CDR you mean complementarity determining regions from B or T cell receptors? Is the data from human or mouse? If that is the case, rather than reinventing the wheel, you might want to take a look at existing tools. I have used mixcr. Another popular tool is IMGT/HighV-QUEST but AFAIK it is only available as a web app and cannot be used for large datasets. If they do not fit your purposes, you might at least get hints on how to proceed.

like image 142
Cloudberry Avatar answered Sep 04 '25 10:09

Cloudberry