finding CDRs in NGS data

Question

I have millions of sequences in fasta format and want to extract CDRs (CDR1, CDR2 and CDR3).I chose only one sequence as an example and tried to extract CDR1 but not able to extract CDR1.

sequence:-'FYSHSAVTLDESGGGLQTPGGGLSLVCKASGFTFSSYGMMWVRQAPGKGLEYVAGIRNDA GDKRYGSAVQGRATISRDNGQSTVRLQLNNLRAEDTGTYFCAKESGCYWDSTHCIDAWGH GTEVIVSTGG'.

cdr1 starts from:- 'VCKASGFTFS', with maximum three replacements but C at 2nd place is must. cdr1 ends at:-'WVRQAP', with maximum two replacements but R at 3rd place is must.

Extracted cdr1 should be SYGMM

def cdr1_in(cdr_in): #VCKASGFTFS
    pin=0
    max_pin=3       
    
    if cdr[1]!='C':
        pin+=1
    if cdr[0]!='V':
        pin+=1
    if cdr[2]!='K':
        pin+=1
    if cdr[3]!='A':
        pin+=1    
    if cdr[4]!='S':
        pin+=1
    if cdr[5]!='G':
        pin+=1
    if cdr[6]!='F':
        pin+=1
    if cdr[7]!='T':
        pin+=1    
    if cdr[8]!='F':
        pin+=1
    if cdr[9]!='S':
        pin+=1   
  
    if pin<max_pin:
        print('CDR_in pattern', cdr_in)
        # print('CDR_starts from', arr.index(cdr_in)+9)
        return (arr.index(cdr_in)+9)
 
    def cdr1_out(cdr_out):#WVRQAP
    
        pin=0
        max_pin=2            
        if cdr[1]!='V':
            pin+=1
        if cdr[0]!='W':
            pin+=1
        if cdr[2]!='R':
            pin+=1
        if cdr[3]!='Q':
            pin+=1    
        if cdr[4]!='A':
            pin+=1
        if cdr[5]!='P':
            pin+=1
            
        if pin<max_pin:
            # print('CDR_in pattern', cdr_out)
            # print('CDR_ends at', arr.index(cdr_out))
            return (arr.index(cdr_out))
 

K=10
arr=sequence
for i in range(len(arr)-k+1):
        slider=arr[i:k+i]
        print("CDR_1 is:", arr[cdr1_in(slider): cdr1_out(slider)])

Cloudberry · Accepted Answer

Am I right to assume that you are analysing immunosequencing data and by CDR you mean complementarity determining regions from B or T cell receptors? Is the data from human or mouse? If that is the case, rather than reinventing the wheel, you might want to take a look at existing tools. I have used mixcr. Another popular tool is IMGT/HighV-QUEST but AFAIK it is only available as a web app and cannot be used for large datasets. If they do not fit your purposes, you might at least get hints on how to proceed.

finding CDRs in NGS data

Tags:

python

regex

fuzzy-search

biopython

python-re

shivam

1 Answers

Cloudberry

Recent Activity

Donate For Us

finding CDRs in NGS data

Tags:

python

regex

fuzzy-search

biopython

python-re

shivam

1 Answers

Cloudberry

Related questions

Recent Activity

Donate For Us