Extracting links to pages in another PDF from PDF using Python or other method

Question

I have 5 PDF files, each of which have links to different pages in another PDF file. The files are each tables of contents for large PDFs (~1000 pages each), making manual extraction possible, but very painful. So far I have tried to open the file in Acrobat Pro, and I can right click on each link and see what page it points to, but I need to extract all the links in some manner. I am not opposed to having to do a good amount of further parsing of the links, but I can't seem to pull them out by any means. I tried to export the PDF from Acrobat Pro as HTML or Word, but both methods didn't maintain the links.

I'm at my wits end, and any help would be great. I'm comfortable working with Python, or a range of other languages

lafras · Accepted Answer

Looking for URIs using pyPdf,

import pyPdf

f = open('TMR-Issue6.pdf','rb')

pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for pg in range(pgs):

    p = pdf.getPage(pg)
    o = p.getObject()

    if o.has_key(key):
        ann = o[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
                print u[ank][uri]

gives,

http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html

etc...

I couldn't find a file that had links to another pdf, but I suspect that the URI field should contain URIs of the form file:///myfiles

Extracting links to pages in another PDF from PDF using Python or other method

Tags:

python

pdf

Ian Bell

1 Answers

lafras

Recent Activity

Donate For Us

Extracting links to pages in another PDF from PDF using Python or other method

Tags:

python

pdf

Ian Bell

1 Answers

lafras

Related questions

Recent Activity

Donate For Us