Hi
I am currently involved in making a website aimed at combining all papillomavirus information in a single place.
As part of the effort we are curating all known files on public servers (e.g. genbank)
One of the issues I ran into was that many (~50%) of all solved structures are not numbered according to the protein.
I.e. a subdomain was crystallized (amino acid 310-450) however the crystallographer deposited this as residue 1-140.
I was wondering whether anyone knows of a way to renumber the entire pdb file. I have found ways to renumber the sequence (identified by seqres), however this does not update the helix and sheet information.
I would appreciate it if you had any suggestions…
Thanks
the code just load the pdb file throught PDBParser() and loop over the pdb structure object changing id of residues starting from 1 and adding +1 on each loop, then saves the renubered_structure through PDBIO() (first set the structure then saves it).
First you can add your missing residues manually in your PDB file (e.g. if it is a histidine, you can just copy any histidine residue atoms and use the dummy coordinate) followed by refinement using ModLoop server to fix the coordinate. You can use PDBFIXER with the original information from your *. PDB file.
Open the pdb file of the dimeric protein in pymol. See the sequence for chain information. Delete the sequence (chain/chains) such that only the sequence of the monomer remains. Save the file.
The easiest way to fix the missing atoms is that you can open your PDB structure in swiss PDB viewer software and then save the structure without doing any extra action. Then you can use pdb2gmx.
I'm the maintainer of pdb-tools - which may be a tool that can assist you.
I have recently modified the residue-renumber
script within my application to provide more flexibility. It can now renumber
hetatms and specific chains, and either force the residue numbers to be continuous or just add a user-specified offset to all residues.
Please let me know if this assists you.
I frequently encounter this problem too. After abandoning an old perl script I had for this I've been experimenting with some python instead. This solution assumes you've got Biopython, ProDy (http://www.csb.pitt.edu/ProDy/#prody) and EMBOSS (http://emboss.sourceforge.net/) installed.
I used one of the papillomavirus PDB entries here.
from Bio import AlignIO,SeqIO,ExPASy,SwissProt
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.Emboss.Applications import NeedleCommandline
from prody.proteins.pdbfile import parsePDB, writePDB
import os
oneletter = {
'ASP':'D','GLU':'E','ASN':'N','GLN':'Q',
'ARG':'R','LYS':'K','PRO':'P','GLY':'G',
'CYS':'C','THR':'T','SER':'S','MET':'M',
'TRP':'W','PHE':'F','TYR':'Y','HIS':'H',
'ALA':'A','VAL':'V','LEU':'L','ILE':'I',
}
# Retrieve pdb to extract sequence
# Can probably be done with Bio.PDB but being able to use the vmd-like selection algebra is nice
pdbname="2kpl"
selection="chain A"
structure=parsePDB(pdbname)
pdbseq_str=''.join([oneletter[i] for i in structure.select("protein and name CA and %s"%selection).getResnames()])
alnPDBseq=SeqRecord(Seq(pdbseq_str,IUPAC.protein),id=pdbname)
SeqIO.write(alnPDBseq,"%s.fasta"%pdbname,"fasta")
# Retrieve reference sequence
accession="Q96QZ7"
handle = ExPASy.get_sprot_raw(accession)
swissseq = SwissProt.read(handle)
refseq=SeqRecord(Seq(swissseq.sequence,IUPAC.protein),id=accession)
SeqIO.write(refseq, "%s.fasta"%accession,"fasta")
# Do global alignment with needle from EMBOSS, stores entire sequences which makes numbering easier
needle_cli = NeedleCommandline(asequence="%s.fasta"%pdbname,bsequence="%s.fasta"%accession,gapopen=10,gapextend=0.5,outfile="needle.out")
needle_cli()
aln = AlignIO.read("needle.out", "emboss")
os.remove("needle.out")
os.remove("%s.fasta"%pdbname)
os.remove("%s.fasta"%accession)
alnPDBseq = aln[0]
alnREFseq = aln[1]
# Initialize per-letter annotation for pdb sequence record
alnPDBseq.letter_annotations["resnum"]=[None]*len(alnPDBseq)
# Initialize annotation for reference sequence, assume first residue is #1
alnREFseq.letter_annotations["resnum"]=range(1,len(alnREFseq)+1)
# Set new residue numbers in alnPDBseq based on alignment
reslist = [[i,alnREFseq.letter_annotations["resnum"][i]] for i in range(len(alnREFseq)) if alnPDBseq[i] != '-']
for [i,r] in reslist:
alnPDBseq.letter_annotations["resnum"][i]=r
# Set new residue numbers in the structure
newresnums=[i for i in alnPDBseq.letter_annotations["resnum"][:] if i != None]
resindices=structure.select("protein and name CA and %s"%selection).getResindices()
resmatrix = [[newresnums[i],resindices[i]] for i in range(len(newresnums)) ]
for [newresnum,resindex] in resmatrix:
structure.select("resindex %d"%resindex).setResnums(newresnum)
writePDB("%s.renumbered.pdb"%pdbname,structure)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With