I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.
For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out
, but what I really need is the hyperlink itself, not the words.
How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.
I have looked at itextsharp
, but haven't used it. I'm running on Ubuntu
, and would appreciate any help.
To show or hide hyperlinks, choose View > Extras > Show Hyperlinks or Hide Hyperlinks. Note: Hyperlinks are included in exported Adobe PDF files if Hyperlinks is selected in the Export Adobe PDF dialog box in InDesign.
Common Python Libraries for PDF ScrapingPyPDF2 is a pure-python library used for PDF files handling. It enables the content extraction, PDF documents splitting into pages, document merging, cropping, and page transforming. It supports both encrypted and unencrypted documents.
slightly modified version of Ashwin's Answer:
import PyPDF2
PDFFile = open("file.pdf",'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
print("Current Page: {}".format(page))
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if key in pageObject.keys():
ann = pageObject[key]
for a in ann:
u = a.getObject()
if uri in u[ank].keys():
print(u[ank][uri])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With