I need to programmatically analyze and combine several (hundreds) of PDF documents, and link the pages together in specialized ways. Each PDF includes text in each location where a link belongs, indicating what it should link to. I'm using pdfminer
to extract the location and text where the links should be; now I just need to actually create those links.
I've done some research and concluded that PyPDF2
can supposedly do this. At any rate, there's a seemingly-straightforward addLink
method that claims to get the job done. I just can't get it to work.
from PyPDF2 import PdfFileWriter
from PyPDF2.pdf import RectangleObject
out = PdfFileWriter()
out.insertBlankPage(800, 1000)
out.insertBlankPage(800, 1000)
# rect = [400, 400, 600, 600] # This doesn't seem to work either
rect = RectangleObject([400, 400, 600, 600])
out.addLink(0, 1, rect) # link from first to second page
with open(r'C:\temp\test.pdf', 'wb') as outf:
out.write(outf)
The code above produces a beautiful two-page PDF with nothing in it, at least as far as I can tell. Does anyone out there know how this might be accomplished? Or at least an indication of where I'm going wrong?
A solution doesn't have to use PyPDF2, as long as the library is freely licensed. Strictly speaking, Python isn't even a requirement, but it would be nice to fit this into my current structure without hacking another language onto it.
This appears to be a bug in the implementation of addLink
, or possibly that method is simply meant for an older or different link syntax. In any event, inspecting the structure of the output PDF from the example code in the question reveals this little gem:
6 0 obj
<<
/Dest [ 4 0 R /FitV 826 ]
/Type /Annot
/Rect RectangleObject([400, 400, 600, 600])
/Border [ 0 0 0 ]
/P IndirectObject(5, 0)
/Subtype /Link
>>
There are several problems with this. Most obvious is that RectangleObject
and IndirectObject
are constructs of the Python library, not valid PDF structures. /Dest
also seems to have a mysterious magic parameter on it that I didn't ask for. Further, /P
would be redundant (a reference to the page that contains this link), even if it were implemented in a way that didn't slap Python objects into the PDF structure. So in short, it's no wonder that this link is broken.
Messing around with the source a bit to eliminate the crashing errors, it turns out that two changes are needed* to get the link into working order: changing the internal representation of the /Rect
from a NameObject
to an ArrayObject
, and changing the /P
reference to point at the page number, rather than the actual object. These changes let the example code produce valid output:
6 0 obj
<<
/Dest [ 4 0 R /FitV ]
/Type /Annot
/Rect [ 400 400 600 600 ]
/Border [ 0 0 0 ]
/P 0
/Subtype /Link
>>
Et voilà, the link works exactly as expected in the output! I also removed the magic 826
from the /Rect
value, since it may not be a legal parameter depending on the zoom level, and it really shouldn't be hard-coded anyway.
*After concluding that this fix works as intended, I did figure out that leaving /Rect
as a NameObject
and passing it a string that looks like the output should (e.g. '[ 400 400 600 600 ]'
) will also work. This is presumably intended to allow maximum flexibility, but it sure is unexpected.
Update: I put together and submitted a more complete fix (link to the patch for posterity), so the issues above should all be fixed, as of version 1.22.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With