Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a .pdb file in Python

Tags:

python

I'm trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb

If you scroll down far enough you will get to a line that looks like this: ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N

The first element "atom" means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, "MET" is the name of the residue, "A" relates to the the type of chain, 1 (the second "1") is the atom count and then the next 3 numbers are the x-y-z positions in space.

What I need output is something like this (where the "1" below corresponds to the atom count, not the general count): MET A 1 63.645 97.355 31.526

To make matters more complicated, sometimes the atom count (the second "1" in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:

ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C

While they both refer to residue 21, the biochem isn't precise enough to get an exact position, so they give two options. Ideally, I would specify "1", "2" or whatever, but if I just take the first option that would be ok. Finally, for the type of atom ("N") in my original example, I only want to get those lines with a "CA".

I'm a newbie to python, and my training is in biostats so I was wondering what's the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?

I realize it's a bit to ask, but some guidance would be a ton of help! I've programmed all the statistics bits in using R, but now I just need to get my files in the right format!

Thanks!

like image 581
user1357015 Avatar asked Apr 25 '12 22:04

user1357015


People also ask

How do I read a PDB file in Python?

Read PDB files.Write a function readPDBfile(filename) that will read the atoms for a protein stored in pdb file whose name is given as the argument. Your function should return a Python tuple containing 4 values: (anum, aname, resno, coords) . anum should be an array with the serial number for each atom.

How do I read a PBD file?

You can open this kind of PDB file with any program that can read text documents, like the built-in Notepad program in Windows. Some other compatible viewers and editors include Notepad++ and Brackets. Other PDB files aren't text documents and are only useful when opened with the program that it's intended for.

What is PDB parser?

PDBparse is a GPL-licensed library for parsing Microsoft PDB files. Support for these is already available within Windows through the Debug Interface Access API, however, this interface is not usable on other operating systems.


1 Answers

I am mildly surprised that no one mentioned the Bio.PDB package from BioPython. Writing a PDB parser on your own is a rather serious case of unnecessarily reinventing, I mean reimplementing the wheel.

BioPython is a useful collection of packages for working with other kinds of biological data (e.g. protein or nucleic acid sequences) as well.

I am aware that this was an old question but maybe someone still finds this pointer helpful.

like image 149
Laryx Decidua Avatar answered Sep 28 '22 07:09

Laryx Decidua