Extract hyperlinks from PDF in Python

Tags:

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.

953

asked Jan 02 '15 15:01

Randomly Named User

1 Answers

slightly modified version of Ashwin's Answer:

import PyPDF2
PDFFile = open("file.pdf",'rb')

PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):
    print("Current Page: {}".format(page))
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if key in pageObject.keys():
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if uri in u[ank].keys():
                print(u[ank][uri])

answered Sep 18 '22 16:09

Imrul Huda

Related questions
                            
                                Append to a dict of lists with a dict comprehension
                            
                                Change &#39 into normal character
                            
                                python tkinter return value from function used in command
                            
                                Extract Google Scholar results using Python (or R)
                            
                                MongoDB: Find the minimum element in array and delete it
                            
                                Numpy error: Singular matrix
                            
                                beautifulSoup html csv
                            
                                How to monitor events from workers in a Celery-Django application?
                            
                                Matplotlib half black and half white circle
                            
                                TypeError: type object argument after * must be a sequence, not generator
                            
                                Python writing binary files, bytes
                            
                                Compare length of three lists in python [closed]
                            
                                How to use timeit when timing a function
                            
                                ImportError: No module named backend_tkagg
                            
                                Getting all rows with NaN value
                            
                                What size to specify to `PIL.Image.frombytes`
                            
                                MongoDB window closes automatically when I try to open
                            
                                How to validate a unit test with random values
                            
                                What content type should be in http header of soap 1.2 message?
                            
                                "Python Implementation" vs. "Python distribution" vs. Python itself?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract hyperlinks from PDF in Python

Tags:

python

hyperlink

pdf

pypdf

pdfminer

Randomly Named User

People also ask

1 Answers

Imrul Huda

Recent Activity

Donate For Us