I have millions of sequences in fasta format and want to extract CDRs (CDR1, CDR2 and CDR3).I chose only one sequence as an example and tried to extract CDR1 but not able to extract CDR1.
sequence:-'FYSHSAVTLDESGGGLQTPGGGLSLVCKASGFTFSSYGMMWVRQAPGKGLEYVAGIRNDA GDKRYGSAVQGRATISRDNGQSTVRLQLNNLRAEDTGTYFCAKESGCYWDSTHCIDAWGH GTEVIVSTGG'.
cdr1 starts from:- 'VCKASGFTFS', with maximum three replacements but C at 2nd place is must. cdr1 ends at:-'WVRQAP', with maximum two replacements but R at 3rd place is must.
Extracted cdr1 should be SYGMM
def cdr1_in(cdr_in): #VCKASGFTFS
pin=0
max_pin=3
if cdr[1]!='C':
pin+=1
if cdr[0]!='V':
pin+=1
if cdr[2]!='K':
pin+=1
if cdr[3]!='A':
pin+=1
if cdr[4]!='S':
pin+=1
if cdr[5]!='G':
pin+=1
if cdr[6]!='F':
pin+=1
if cdr[7]!='T':
pin+=1
if cdr[8]!='F':
pin+=1
if cdr[9]!='S':
pin+=1
if pin<max_pin:
print('CDR_in pattern', cdr_in)
# print('CDR_starts from', arr.index(cdr_in)+9)
return (arr.index(cdr_in)+9)
def cdr1_out(cdr_out):#WVRQAP
pin=0
max_pin=2
if cdr[1]!='V':
pin+=1
if cdr[0]!='W':
pin+=1
if cdr[2]!='R':
pin+=1
if cdr[3]!='Q':
pin+=1
if cdr[4]!='A':
pin+=1
if cdr[5]!='P':
pin+=1
if pin<max_pin:
# print('CDR_in pattern', cdr_out)
# print('CDR_ends at', arr.index(cdr_out))
return (arr.index(cdr_out))
K=10
arr=sequence
for i in range(len(arr)-k+1):
slider=arr[i:k+i]
print("CDR_1 is:", arr[cdr1_in(slider): cdr1_out(slider)])
Am I right to assume that you are analysing immunosequencing data and by CDR you mean complementarity determining regions from B or T cell receptors? Is the data from human or mouse? If that is the case, rather than reinventing the wheel, you might want to take a look at existing tools. I have used mixcr. Another popular tool is IMGT/HighV-QUEST but AFAIK it is only available as a web app and cannot be used for large datasets. If they do not fit your purposes, you might at least get hints on how to proceed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With