Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add in-document link to PDF

I need to programmatically analyze and combine several (hundreds) of PDF documents, and link the pages together in specialized ways. Each PDF includes text in each location where a link belongs, indicating what it should link to. I'm using pdfminer to extract the location and text where the links should be; now I just need to actually create those links.

I've done some research and concluded that PyPDF2 can supposedly do this. At any rate, there's a seemingly-straightforward addLink method that claims to get the job done. I just can't get it to work.

from PyPDF2 import PdfFileWriter
from PyPDF2.pdf import RectangleObject

out = PdfFileWriter()

out.insertBlankPage(800, 1000)
out.insertBlankPage(800, 1000)

# rect = [400, 400, 600, 600] # This doesn't seem to work either
rect = RectangleObject([400, 400, 600, 600])
out.addLink(0, 1, rect) # link from first to second page

with open(r'C:\temp\test.pdf', 'wb') as outf:
    out.write(outf)

The code above produces a beautiful two-page PDF with nothing in it, at least as far as I can tell. Does anyone out there know how this might be accomplished? Or at least an indication of where I'm going wrong?

A solution doesn't have to use PyPDF2, as long as the library is freely licensed. Strictly speaking, Python isn't even a requirement, but it would be nice to fit this into my current structure without hacking another language onto it.

like image 273
Henry Keiter Avatar asked May 23 '14 16:05

Henry Keiter


1 Answers

This appears to be a bug in the implementation of addLink, or possibly that method is simply meant for an older or different link syntax. In any event, inspecting the structure of the output PDF from the example code in the question reveals this little gem:

6 0 obj
<<
/Dest [ 4 0 R /FitV 826 ]
/Type /Annot
/Rect RectangleObject([400, 400, 600, 600])
/Border [ 0 0 0 ]
/P IndirectObject(5, 0)
/Subtype /Link
>>

There are several problems with this. Most obvious is that RectangleObject and IndirectObject are constructs of the Python library, not valid PDF structures. /Dest also seems to have a mysterious magic parameter on it that I didn't ask for. Further, /P would be redundant (a reference to the page that contains this link), even if it were implemented in a way that didn't slap Python objects into the PDF structure. So in short, it's no wonder that this link is broken.

Messing around with the source a bit to eliminate the crashing errors, it turns out that two changes are needed* to get the link into working order: changing the internal representation of the /Rect from a NameObject to an ArrayObject, and changing the /P reference to point at the page number, rather than the actual object. These changes let the example code produce valid output:

6 0 obj
<<
/Dest [ 4 0 R /FitV ]
/Type /Annot
/Rect [ 400 400 600 600 ]
/Border [ 0 0 0 ]
/P 0
/Subtype /Link
>>

Et voilà, the link works exactly as expected in the output! I also removed the magic 826 from the /Rect value, since it may not be a legal parameter depending on the zoom level, and it really shouldn't be hard-coded anyway.


*After concluding that this fix works as intended, I did figure out that leaving /Rect as a NameObject and passing it a string that looks like the output should (e.g. '[ 400 400 600 600 ]') will also work. This is presumably intended to allow maximum flexibility, but it sure is unexpected.


Update: I put together and submitted a more complete fix (link to the patch for posterity), so the issues above should all be fixed, as of version 1.22.

like image 151
Henry Keiter Avatar answered Nov 16 '22 02:11

Henry Keiter