How to extract all links from pdf file?

Tags:

By standard, links are hiding in Annotations (section 12.5.6.5 from specifications). It is easy to extract address from there: Extracting links to pages in another PDF from PDF using Python or other method But very often links are presented not like special objects in document, but as plain text like "http://blah-blah.com". How do I extract not only links from annotations, but links from text itself? I can search through the whole text and finding words like "http://", but is there more optimal solution? PDF editors are highlighting text-links too, how do they know that this piece of text is hyperlink?

312

asked Jul 15 '15 16:07

m9_psy

2 Answers

I've just made pdfx, a small tool for exactly this job: to download all PDFs from a given PDF. It's written in Python and released as open source under the GPLv3 license: http://www.metachris.com/pdfx

You can use pdfx tool to show all PDF URLs, all URLs (with -v), as well as download all referenced PDFs (using -d):

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
Reading url 'https://weakdh.org/imperfect-forward-secrecy.pdf'...
Saved pdf as './imperfect-forward-secrecy.pdf'
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- Pages = 13

Analyzing text...
- URLs: 49
- URLs to PDFs: 17

JSON summary saved as './imperfect-forward-secrecy.pdf.infos.json'

Downloading 17 referenced pdfs...
Created directory './imperfect-forward-secrecy.pdf-referenced-pdfs'
Downloaded 'http://cr.yp.to/factorization/smoothparts-20040510.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/smoothparts-20040510.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35517.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35517.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35514.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35514.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35519.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35519.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35522.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35522.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35509.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35509.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35528.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35528.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35513.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35513.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35533.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35533.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35551.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35551.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35527.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35527.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35520.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35520.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35526.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35526.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35515.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35515.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35529.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35529.pdf'...
Downloaded 'http://cryptome.org/2013/08/spy-budget-fy13.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/spy-budget-fy13.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35671.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35671.pdf'...

You can install it with $ easy_install -U pdfx.

Under the hood, pdfx uses PyPDF2, a Python library to read PDF content and then a regular expression to match all urls.

110

answered Oct 01 '22 09:10

Chris Hager

strings "somePDFfile.pdf" | grep http

This works even better if you use pdftk to uncompress it first (credits: Ben Stern):

pdftk in.pdf cat output out.pdf uncompress; strings out.pdf | grep -i http

answered Oct 02 '22 09:10

ibisum

Related questions
                            
                                How to install snappy C libraries on Windows 10 for use with python-snappy in Anaconda?
                            
                                Airflow user creation
                            
                                not able to insert data using ZADD(sorted set ) in redis using python
                            
                                How can I draw a bezier curve using Python's PIL?
                            
                                Python converting the values from dicts into a tuples
                            
                                Python: how to do basic data manipulation like in R?
                            
                                How to declare a dictionary with inline function
                            
                                How do I redirect stdout to a file when using subprocess.call in python?
                            
                                Evaluate multiple variables in one 'if' statement?
                            
                                How to use PyPy on Windows?
                            
                                Python program that finds most frequent word in a .txt file, Must print word and its count
                            
                                How to convert dictionary into string
                            
                                Tab/Enter (and other keystrokes) handling in Kivy's TextInput widgets
                            
                                How can I rename a column label in Django Admin for a field that is a method//property?
                            
                                Python itertools permutations how to include repeating characters [duplicate]
                            
                                Animated sprite from few images
                            
                                getattr() versus dict lookup, which is faster?
                            
                                ImportError: cannot import name MAXREPEAT with cx_Freeze
                            
                                how could we install opencv on anaconda?
                            
                                Receiving RTP packets after RTSP setup

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract all links from pdf file?

Tags:

python

pdf

pypdf

m9_psy

People also ask

2 Answers

Chris Hager

ibisum

Recent Activity

Donate For Us