Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Structure of a PDF file? [closed]

Tags:

pdf

For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.

But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(

like image 406
Valentin Jacquemin Avatar asked Sep 17 '08 23:09

Valentin Jacquemin


People also ask

What is the structure of a PDF file?

Document structure. A PDF document consists of objects contained in the body section of a PDF file. Most of the objects in a PDF document are dictionaries. Each page of the document is represented by a page object, which is a dictionary that includes references to the page's contents.

Is PDF a closed format?

So the PDF file format is now a totally Free (as in air) and Open format, and proving more popular than ever.


2 Answers

Here is a link to Adobe's reference material

http://www.adobe.com/devnet/pdf/pdf_reference.html

You should know though that PDF is only about presentation, not structure. Parsing will not come easy.

like image 187
minty Avatar answered Sep 27 '22 20:09

minty


I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.

Other helpful links:

  • PDF Succinctly book is longer and has helpful pictures.
  • Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.
like image 35
Jeff Moser Avatar answered Sep 27 '22 20:09

Jeff Moser