For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways. But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(

Here is a link to Adobe's reference material http://www.adobe.com/devnet/pdf/pdf_reference.html You should know though that PDF is only about presentation, not structure. Parsing will not come easy.

I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail. Other helpful links: <ul> <li> PDF Succinctly book is longer and has helpful pictures.</li> <li> Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.</li> </ul>

Structure of a PDF file? [closed]

Tags:

pdf

For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.

But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(

406

asked Sep 17 '08 23:09

Valentin Jacquemin

2 Answers

Here is a link to Adobe's reference material

http://www.adobe.com/devnet/pdf/pdf_reference.html

You should know though that PDF is only about presentation, not structure. Parsing will not come easy.

187

answered Sep 27 '22 20:09

minty

I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.

Jeff Moser

Related questions
                            
                                How to extract text from the PDF document? [closed]
                            
                                CLI pdf viewer for linux [closed]
                            
                                C# 4.0: Convert pdf to byte[] and vice versa
                            
                                How to make annotation like highlighting, strikethrough, underline, draw, add text, etc in android for a pdf viewer?
                            
                                PDF specifications for coders: Adobe or ISO?
                            
                                Get the number of pages in a PDF document
                            
                                Ghostscript to merge PDFs compresses the result
                            
                                Convert Word doc and docx format to PDF in .NET Core without Microsoft.Office.Interop
                            
                                Opening PDF String in new window with javascript
                            
                                Asp.Net MVC how to get view to generate PDF
                            
                                Split a PDF in two
                            
                                How to execute ImageMagick to convert only the first page of the multipage PDF to JPEG?
                            
                                How do I use pdfminer as a library
                            
                                Pandoc and foreign characters
                            
                                How to read PDF files using Java? [closed]
                            
                                HTML embedded PDF iframe
                            
                                Parsing PDF files (especially with tables) with PDFBox
                            
                                Pdf.js: rendering a pdf file using a base64 file source instead of url
                            
                                How to insert a page break in HTML so wkhtmltopdf parses it?
                            
                                Best way to convert pdf files to tiff files [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With