Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Export PDF page labels on command line

I'd like to export the page-labels stored in some PDF documents for easy parsing. I know I could dig into the PDF document after having it converted with qpdf, but this seems like overkill.

Is there no commandline tool that will simply print the page label for each page (or together with other meta-data)? I know that PDFSpy will export the label, but $300 isn't an option, preferably the solution should be free.

like image 765
grovel Avatar asked Oct 16 '12 21:10

grovel


2 Answers

I've written a small command-line utility based on Poppler that does just this task: https://github.com/HeimMatthias/pdfpagelabels

Disclaimer: I'm the OP and created the original post under a different account. I have been using the solution via pdftk (listed in a comment above) successfully for years in my implementation. However, last year it was time to reimplement our system from scratch and we've had numerous instances where the pdf-tk output could not be parsed by our implementation.

The new command-line tool follows the philosophy of doing just one thing, but doing it well, and simply prints the page labels of all or selected pages of a pdf-file. If anyone finds this useful, and stumbles upon it here, all the better for it.

like image 85
mheim Avatar answered Nov 03 '22 07:11

mheim


Short answer:
I am not aware of any (free) tool that can 'simply print' the page label for each page.

Also, you'll not be able to evade the expansion compressed objects and object streams, using a tool like qpdf or one with equivalent capabilities.

Long answer:
There's no such tool because these are the only a few things you can safely rely on when it comes to page labels. These are the following:

  1. Each PDF document must contain a root object.
  2. That root object must be of /Type /Catalog.
  3. The document's trailer will show where to find the object using the key /Root followed by the indirect object number reference.
  4. IF a PDF document uses non-standard page labels, then the document root object must have an entry named /PageLabels.

Here is where it stops to be relatively easy. Because the object the /PageLabels key refers to may be contained in a compressed object stream. This means that you'd have to expand that object stream.

If you really succeeded to get the description of the page labels as ASCII, you'll discover that it's not an easily parseable flat list (like a dictionary is): it is a number tree.

I'll not go into the details of these complexities, because it would take a very long article to describe all possible variations. You better read it up directly in the official ISO PDF-1.7 specification.

But instead I'll give you an example in ASCII PDF code:

213 0 obj
  << /Type /Catalog
     /PageLabels 
        << 
           /Nums 
                 [ 
                   0 <<           % start labeling from page no. 1
                       /S /r      % label with lowercase roman numbers
                     >> 
                   7 <<           % start new labeling from page no. 8
                       /S /D      % label with standard decimal numbers
                     >> 
                   11 <<          % start labeling page no. 12
                       /S /D      % label with decimal numbers...
                       /P (ABCD-) %   ...but using label prefix 'ABCD-'...
                       /St 3      %   ...followed by '3' as the start decimal.
                     >>
                  ]
        >>
     %%...........................
     %%...more root object keys...
     %%........................... 
  >>
endobj

The above example will label the pages number 1, 2, 3, ... (last) like this:

i
ii
iii
iv
v
vi
1
2
3
4
ABCD-3
ABCD-4
ABCD-5
ABCD-6
...and so on until last page...

As you can see, the PDF method of labeling pages (mapping page numbers to page names) is completely non-intuitive. You can only understand it by studying the PDF specification.

like image 22
Kurt Pfeifle Avatar answered Nov 03 '22 05:11

Kurt Pfeifle