For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.
But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(
Document structure. A PDF document consists of objects contained in the body section of a PDF file. Most of the objects in a PDF document are dictionaries. Each page of the document is represented by a page object, which is a dictionary that includes references to the page's contents.
So the PDF file format is now a totally Free (as in air) and Open format, and proving more popular than ever.
Here is a link to Adobe's reference material
http://www.adobe.com/devnet/pdf/pdf_reference.html
You should know though that PDF is only about presentation, not structure. Parsing will not come easy.
I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.
Other helpful links:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With