I have 5 PDF files, each of which have links to different pages in another PDF file. The files are each tables of contents for large PDFs (~1000 pages each), making manual extraction possible, but very painful. So far I have tried to open the file in Acrobat Pro, and I can right click on each link and see what page it points to, but I need to extract all the links in some manner. I am not opposed to having to do a good amount of further parsing of the links, but I can't seem to pull them out by any means. I tried to export the PDF from Acrobat Pro as HTML or Word, but both methods didn't maintain the links.
I'm at my wits end, and any help would be great. I'm comfortable working with Python, or a range of other languages
Looking for URIs using pyPdf,
import pyPdf
f = open('TMR-Issue6.pdf','rb')
pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for pg in range(pgs):
p = pdf.getPage(pg)
o = p.getObject()
if o.has_key(key):
ann = o[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
gives,
http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html
etc...
I couldn't find a file that had links to another pdf, but I suspect that the URI field should contain URIs of the form file:///myfiles
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With