Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract hyperlinks from PDF in Python

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.

like image 953
Randomly Named User Avatar asked Jan 02 '15 15:01

Randomly Named User


People also ask

How do I find the hyperlinks in a PDF?

To show or hide hyperlinks, choose View > Extras > Show Hyperlinks or Hide Hyperlinks. Note: Hyperlinks are included in exported Adobe PDF files if Hyperlinks is selected in the Export Adobe PDF dialog box in InDesign.

Can I scrape a PDF Python?

Common Python Libraries for PDF ScrapingPyPDF2 is a pure-python library used for PDF files handling. It enables the content extraction, PDF documents splitting into pages, document merging, cropping, and page transforming. It supports both encrypted and unencrypted documents.


1 Answers

slightly modified version of Ashwin's Answer:

import PyPDF2
PDFFile = open("file.pdf",'rb')

PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):
    print("Current Page: {}".format(page))
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if key in pageObject.keys():
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if uri in u[ank].keys():
                print(u[ank][uri])
like image 90
Imrul Huda Avatar answered Sep 18 '22 16:09

Imrul Huda