Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Structure of a PDF file? [closed]



For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.

But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(

like image 406
Valentin Jacquemin Avatar asked Sep 17 '08 23:09

Valentin Jacquemin

People also ask

What is the structure of a PDF file?

Document structure. A PDF document consists of objects contained in the body section of a PDF file. Most of the objects in a PDF document are dictionaries. Each page of the document is represented by a page object, which is a dictionary that includes references to the page's contents.

Is PDF a closed format?

So the PDF file format is now a totally Free (as in air) and Open format, and proving more popular than ever.

2 Answers

Here is a link to Adobe's reference material


You should know though that PDF is only about presentation, not structure. Parsing will not come easy.

like image 187
minty Avatar answered Sep 27 '22 20:09


I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.

Other helpful links:

  • PDF Succinctly book is longer and has helpful pictures.
  • Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.
like image 35
Jeff Moser Avatar answered Sep 27 '22 20:09

Jeff Moser