Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting links to pages in another PDF from PDF using Python or other method

Tags:

python

pdf

I have 5 PDF files, each of which have links to different pages in another PDF file. The files are each tables of contents for large PDFs (~1000 pages each), making manual extraction possible, but very painful. So far I have tried to open the file in Acrobat Pro, and I can right click on each link and see what page it points to, but I need to extract all the links in some manner. I am not opposed to having to do a good amount of further parsing of the links, but I can't seem to pull them out by any means. I tried to export the PDF from Acrobat Pro as HTML or Word, but both methods didn't maintain the links.

I'm at my wits end, and any help would be great. I'm comfortable working with Python, or a range of other languages

like image 615
Ian Bell Avatar asked May 12 '11 04:05

Ian Bell


1 Answers

Looking for URIs using pyPdf,

import pyPdf

f = open('TMR-Issue6.pdf','rb')

pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for pg in range(pgs):

    p = pdf.getPage(pg)
    o = p.getObject()

    if o.has_key(key):
        ann = o[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
                print u[ank][uri]

gives,

http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html

etc...

I couldn't find a file that had links to another pdf, but I suspect that the URI field should contain URIs of the form file:///myfiles

like image 82
lafras Avatar answered Oct 30 '22 08:10

lafras